Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/299.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 删除以另一列的大值为条件的值_Python_Pandas_Loops_Time Series - Fatal编程技术网

Python 删除以另一列的大值为条件的值

Python 删除以另一列的大值为条件的值,python,pandas,loops,time-series,Python,Pandas,Loops,Time Series,我有一个timeseries df,包括a列中的每日费率和B列中从一天到下一天的相对变化 DF的外观如下所示: IR Shift May/24/2019 5.9% - May/25/2019 6% 1.67% May/26/2019 5.9% -1.67 May/27/2019 20.2% 292% May/28/2019 20.5%

我有一个timeseries df,包括a列中的每日费率和B列中从一天到下一天的相对变化

DF的外观如下所示:

                   IR      Shift
May/24/2019        5.9%    - 
May/25/2019        6%      1.67%      
May/26/2019        5.9%    -1.67
May/27/2019        20.2%   292%
May/28/2019        20.5%   1.4% 
May/29/2019        20%    -1.6% 
May/30/2019        5.1%   -292%
May/31/2019        5.1%     0%
我想删除A列中出现在较大相对位移之间的所有值,>+/-50%

因此,上述DF应如下所示:

                      IR      Shift
May/24/2019        5.9%    - 
May/25/2019        6%       1.67%      
May/26/2019        5.9%    -1.67
May/27/2019        np.nan   292%
May/28/2019        np.nan   1.4% 
May/29/2019        np.nan  -1.6% 
May/30/2019        5.1%    -292%
May/31/2019        5.1%      0%
到目前为止,这就是我要做的。。。。谢谢你的帮助

 for i, j in df1.iterrows():
      if df1['Shift'][i] > .50 :
          x = df1['IR'][i]
      if df1['Shift'][j] < -.50 :
          y = df1['IR'][j]
      df1['IR'] = np.where(df1['Shift'].between(x,y), df1['Shift'], 
      np.nan)                                                                                                                                  
df1.iterrows()中i,j的

如果df1['Shift'][i]>0.50:
x=df1['IR'][i]
如果df1['Shift'][j]<-.50:
y=df1['IR'][j]
df1['IR']=np.其中(df1['Shift'].介于(x,y)之间),df1['Shift'],
np.nan)

错误值错误:序列的真值不明确。使用a.empty、a.bool()、a.item()、a.any()或a.all()。可能有更多“合适”的方法来实现,但我不熟悉所有的内置功能

df = pd.DataFrame({'Date':[datetime(2019,5,24), datetime(2019,5,25), datetime(2019,5,26), datetime(2019,5,27), datetime(2019,5,28),datetime(2019,5,29),datetime(2019,5,30)], 'IR':[0.059,0.06,0.059,0.202, 0.205, 0.2, 0.051], 'Shift':[pd.np.nan, 0.0167, -0.0167, 2.92, 0.014, -0.016, -2.92]})

>>>df
        Date     IR   Shift
0 2019-05-24  0.059     NaN
1 2019-05-25  0.060  0.0167
2 2019-05-26  0.059 -0.0167
3 2019-05-27  0.202  2.9200
4 2019-05-28  0.205  0.0140
5 2019-05-29  0.200 -0.0160
6 2019-05-30  0.051 -2.9200

df['IR'] = [pd.np.nan if abs(y-z) > 0.5 else x for x, y, z in zip(df['IR'], df['Shift'], df['Shift'].shift(1))]
>>>df
        Date     IR   Shift
0 2019-05-24  0.059     NaN
1 2019-05-25  0.060  0.0167
2 2019-05-26  0.059 -0.0167
3 2019-05-27    NaN  2.9200
4 2019-05-28    NaN  0.0140
5 2019-05-29  0.200 -0.0160
6 2019-05-30    NaN -2.9200

您还可以
np。其中
函数来自numpy,如下所示:

import pandas as pd
import numpy as np

df = pd.DataFrame({'Date':[datetime(2019,5,24), datetime(2019,5,25), datetime(2019,5,26), datetime(2019,5,27), datetime(2019,5,28),datetime(2019,5,29),datetime(2019,5,30)], 'IR':[0.059,0.06,0.059,0.202, 0.205, 0.2, 0.051], Shift':[pd.np.nan, 0.0167, -0.0167, 2.92, 0.014, -0.016, -2.92]})                                                                                                                                                                                                       

df['IR'] = np.where(df['Shift'].between(df['Shift']*0.5, df['Shift']*1.5), df['Shift'], np.nan)                                                                                                                                  

In [8]: df                                                                                                                                                                                                                               
Out[8]: 
        Date      IR   Shift
0 2019-05-24     NaN     NaN
1 2019-05-25  0.0167  0.0167
2 2019-05-26     NaN -0.0167
3 2019-05-27  2.9200  2.9200
4 2019-05-28  0.0140  0.0140
5 2019-05-29     NaN -0.0160
6 2019-05-30     NaN -2.9200

用于访问行/列标签对的单个值

import numpy as np
import pandas as pd
from datetime import datetime

df = pd.DataFrame({'Date':[datetime(2019,5,24), datetime(2019,5,25), datetime(2019,5,26), datetime(2019,5,27), datetime(2019,5,28),datetime(2019,5,29),datetime(2019,5,30),datetime(2019,5,31)], 'IR':[5.9,6,5.9,20.2, 20.5, 20, 5.1, 5.1], 'Shift':[pd.np.nan, 1.67, -1.67, 292, 1.4, -1.6, -292, 0]})

print("DataFrame Before :")
print(df)

count = 1
while (count < len(df.index)):
    if (abs(df.at[count-1, 'Shift'] - df.at[count, 'Shift']) >= 50):
        df.at[count, 'IR'] = np.nan
    count = count + 1

print("DataFrame After :")
print(df)

根据您关于在任何大换档(正换档或负换档)时触发此操作的描述,您可以执行以下操作:

import pandas as pd
import numpy as np
from datetime import datetime

df = pd.DataFrame({'Date':[datetime(2019,5,24), datetime(2019,5,25), datetime(2019,5,26), datetime(2019,5,27), datetime(2019,5,28),datetime(2019,5,29),datetime(2019,5,30)], 'IR':[0.059,0.06,0.059,0.202, 0.205, 0.2, 0.051], 'Shift':[np.nan, 0.0167, -0.0167, 2.92, 0.014, -0.016, -2.92]})

df.loc[(abs(df.Shift) > .5).cumsum() % 2 == 1, 'IR'] = np.nan

        Date     IR   Shift
0 2019-05-24  0.059     NaN
1 2019-05-25  0.060  0.0167
2 2019-05-26  0.059 -0.0167
3 2019-05-27    NaN  2.9200
4 2019-05-28    NaN  0.0140
5 2019-05-29    NaN -0.0160
6 2019-05-30  0.051 -2.9200
步骤:

  • abs(方向位移)>.5:发现大于+/-50%的位移

  • .cumsum():为每个周期提供唯一的值,其中奇数周期是我们要忽略的周期

  • %2==1:检查哪些行具有cumsum()的奇数


注意:如果您想要限制此值,使每个正峰值后面都有一个负峰值,或者反之亦然,则此项不起作用。

不确定您的偏移,因此再次计算。这对你有用吗

import pandas as pd
import numpy as np

df.drop(columns=['Shift'], inplace=True)  ## calculated via method below
df['nextval'] = df['IR'].shift(periods=1)

def shift(current, previous):
    return (current-previous)/previous * 100

indexlist=[]  ## to save index that will be set to null
prior=0  ## temporary flag to store value prior to a peak 
flag=False

for index, row in df.iterrows():    
    if index==0: ## to skip first row of data
        continue

    if flag==False and (shift(row[1], row[2])) > 50:   ## to check for start of peak
        prior=row[2]
        indexlist.append(index)
        flag=True
        continue

    if flag==True:  ## checking until when the peak lasts
        if (shift(row[1], prior)) > 50:
            indexlist.append(index)

df.loc[df.index.isin(indexlist),'IR'] = np.nan ## replacing with nan
打印输出(df)


我们可以在异常值的对之间([1st-2nd]、[3rd-4th]、…)定位行,然后一次屏蔽整个数据帧

安装程序 代码
在这里,我又添加了几行,以显示在出现多个尖峰的情况下这将如何表现。
IR\u modified
是如何用上述逻辑屏蔽
IR

               IR   Shift  IR_modified
May/24/2019   5.9     NaN          5.9
May/25/2019   6.0    1.67          6.0
May/26/2019   5.9   -1.67          5.9
May/27/2019  20.2  292.00          NaN
May/28/2019  20.5    1.40          NaN
May/29/2019  20.0   -1.60          NaN
May/30/2019   5.1 -292.00          5.1
May/31/2019   5.1    0.00          5.1
June/1/2019   7.0  415.00          NaN
June/2/2019  17.0   15.00          NaN
June/3/2019  27.0   12.00          NaN
June/4/2019  17.0  315.00         17.0
June/5/2019   7.0  -12.00          7.0

df.loc[df['Shift']>0.5,'IR']=np.nan

好的,谢谢你的提示,我是编程新手。我该如何尝试…?@yatu,我不清楚你所说的“保留一个临时变量”是什么意思,只要相对于上一个有效样本的更改>50%,就设置为nan。将当前样本与上一个有效值进行比较“。你能举个例子吗?什么是相对位移?@rprakash,IR(colmun A)从一天到下一天的变化。@ALollz,确实如此。你是对的。事实上,在我的数据时间序列中,峰值(>50)发生在整个数据中,因此需要删除那些大换班之间的所有值。我回到家后一定会尝试你的代码。但似乎你理解问题。谢谢。让我来消化一下上面的内容。第一次运行似乎没有效果。也许对编码不熟悉的人,可能会找到一个regular循环比这个不可读的列表理解更容易?谢谢Olel。没有骰子。上面的代码删除了rel变化低于.50的数据点。此外,这的主要目标是删除时间序列中的一组数据,速率先上升一段时间,然后下降,几乎与初始峰值上升的相对速度相同。第5行IR在您的“数据帧之后”中,也应该是NaN。而第6行和第7行不应该是NaN。想象一下,一只股票在一次公告后暴涨,然后在几个月后下降到正常水平。这里的目标是去除在上升期间出现的价格…所以第6行和第7行代表正常期间…第3/4/5行是上升期…IR的价格上涨和“正常”的恢复都以相似的幅度发生,292%,尽管方向相反。因此,我们如何设置代码,在超过50%时触发,并在-50%发生类似但相反的移动时停止-将中间的所有值转换为NaN?因此,不确定您为什么选择为了确定是否超过50%,班次列告诉我们一天到下一天的变化是多少…并且在“峰值”之后,班次每天都是很小的,除非IR下降…请对您的答案提供一些解释。
          date   IR  nextval
0  May/24/2019  5.9      NaN
1  May/25/2019  6.0      5.9
2  May/26/2019  5.9      6.0
3  May/27/2019  NaN      5.9
4  May/28/2019  NaN     20.2
5  May/29/2019  NaN     20.5
6  May/30/2019  5.1     20.0
7  May/31/2019  5.1      5.1
import pandas as pd
import numpy as np

df = pd.read_clipboard()
df = df.apply(lambda x: pd.to_numeric(x.str.replace('%', ''), errors='coerce'))

               IR   Shift
May/24/2019   5.9     NaN
May/25/2019   6.0    1.67
May/26/2019   5.9   -1.67
May/27/2019  20.2  292.00
May/28/2019  20.5    1.40
May/29/2019  20.0   -1.60
May/30/2019   5.1 -292.00
May/31/2019   5.1    0.00
# Locate the extremal values
s = df.Shift.lt(-50) | df.Shift.gt(50)

# Get the indices between consecutive pairs. 
# This doesn't mask 2nd outlier, which matches your output
m = s.cumsum()%2==1

df.loc[m, 'IR'] = np.NaN
#              IR   Shift
#May/24/2019  5.9     NaN
#May/25/2019  6.0    1.67
#May/26/2019  5.9   -1.67
#May/27/2019  NaN  292.00
#May/28/2019  NaN    1.40
#May/29/2019  NaN   -1.60
#May/30/2019  5.1 -292.00
#May/31/2019  5.1    0.00
               IR   Shift  IR_modified
May/24/2019   5.9     NaN          5.9
May/25/2019   6.0    1.67          6.0
May/26/2019   5.9   -1.67          5.9
May/27/2019  20.2  292.00          NaN
May/28/2019  20.5    1.40          NaN
May/29/2019  20.0   -1.60          NaN
May/30/2019   5.1 -292.00          5.1
May/31/2019   5.1    0.00          5.1
June/1/2019   7.0  415.00          NaN
June/2/2019  17.0   15.00          NaN
June/3/2019  27.0   12.00          NaN
June/4/2019  17.0  315.00         17.0
June/5/2019   7.0  -12.00          7.0