Python 如何在groupby中填写日期限制
我正在使用以下数据帧,其中包含一些NaN值Python 如何在groupby中填写日期限制,python,pandas,group-by,fillna,Python,Pandas,Group By,Fillna,我正在使用以下数据帧,其中包含一些NaN值 df = pd.DataFrame({'day':[pd.datetime(2020,1,1),pd.datetime(2020,1,3),pd.datetime(2020,1,4),pd.datetime(2020,1,5),pd.datetime(2020,1,6),pd.datetime(2020,1,7),pd.datetime(2020,1,8),pd.datetime(2020,1,8),pd.datetime(2020,6,9)],
df = pd.DataFrame({'day':[pd.datetime(2020,1,1),pd.datetime(2020,1,3),pd.datetime(2020,1,4),pd.datetime(2020,1,5),pd.datetime(2020,1,6),pd.datetime(2020,1,7),pd.datetime(2020,1,8),pd.datetime(2020,1,8),pd.datetime(2020,6,9)],
'TradeID':['01','02','03','04','05','06','07','08','09'],
'Security': ['GOOGLE', 'GOOGLE', 'APPLE', 'GOOGLE', 'GOOGLE','GOOGLE','GOOGLE','GOOGLE','GOOGLE'],
'ID': ['ID001', 'ID001', 'ID001', 'ID001', 'ID001','ID001','ID001','ID001','ID001'],
'BSType': ['B', 'S', 'B', 'B', 'B','S','S','S','B'],
'Price':[105.901,106.969,np.nan,107.037,107.038,107.136,np.nan,107.25,np.nan],
'Quantity':[1000000,-300000,np.nan,7500000,100000,-100000,np.nan,-7800000,np.nan]
})
Out[318]:
day TradeID Security ID BSType Price Quantity
0 2020-01-01 01 GOOGLE ID001 B 105.901 1000000.0
1 2020-01-03 02 GOOGLE ID001 S 106.969 -300000.0
2 2020-01-04 03 APPLE ID001 B NaN NaN
3 2020-01-05 04 GOOGLE ID001 B 107.037 7500000.0
4 2020-01-06 05 GOOGLE ID001 B 107.038 100000.0
5 2020-01-07 06 GOOGLE ID001 S 107.136 -100000.0
6 2020-01-08 07 GOOGLE ID001 S NaN NaN
7 2020-01-08 08 GOOGLE ID001 S 107.250 -7800000.0
8 2020-06-09 09 GOOGLE ID001 B NaN NaN
我的目标是在接下来的60天内(不是接下来的60次观察,因为每天可能有不止一次观察),仅针对相同的安全性、相同的ID和限制,使用ffill方法填充NA
这是我尝试过但不起作用的东西,它不能取代我的任何价值观
df=df.groupby(['day',"Security","ID"], as_index=False).fillna(method='ffill',limit=60)
预期的输出应该如下所示:(注意,只有第二对NaN值被填充)
- 不应填充第一对NaN值,因为它们的安全性不同
- 第二对NaN值应填入之前的观察值
- NaN上的第三对不应填写,因为它们超出了60天的范围
非常感谢您抽出时间。以下是我的尝试,但不确定这是否具有特别的可扩展性:
filled_df = df.groupby(["Security","ID"], as_index=False).fillna(method='ffill')
diffs = df.groupby(["Security","ID"])["day"].diff().dt.days
df["diffs"] = diffs
df["price_isna"] = df["Price"].isna()
df["quantity_isna"] = df["Quantity"].isna()
df = df.drop(columns=["Price", "Quantity"]).merge(filled_df, on=["day", "TradeID", "BSType"])
def reverse_fillna(value, value_isna, diffs, time_limit=60):
if (value_isna and (diffs <= time_limit)) or (not value_isna):
return value
else:
return np.nan
df['Price'] = df.apply(lambda row: reverse_fillna(row['Price'], row['price_isna'], row['diffs']), axis=1)
df['Quantity'] = df.apply(lambda row: reverse_fillna(row['Quantity'], row['quantity_isna'], row['diffs']), axis=1)
df.drop(columns=["price_isna", "quantity_isna", "diffs"], inplace=True)
filled_df=df.groupby([“Security”,“ID”],as_index=False).fillna(method='ffill')
diff=df.groupby([“安全性”,“ID”])[“天”].diff().dt.days
df[“差异”]=差异
df[“价格”]=df[“价格”].isna()
df[“数量”]=df[“数量”].isna()
df=df.drop(列=[“价格”、“数量”])。合并(填写日期=[“日期”、“交易ID”、“bType”])
def反向填充(值、值、差值、时间限制=60):
如果(value_isna)和(diff您可以对列安全性和ID
上的数据框进行分组以及额外的列日频率设置为60天,然后使用ffill
向前填充下一个60天的值:
g = pd.Grouper(key='day', freq='60d')
df.assign(**df.groupby(["Security","ID", g]).ffill())
您好,非常感谢您提供了这个有用的提示。它工作得非常好。如果我想将此命令仅限于某一列,即仅在一列上使用此条件的ffill NaN值,该怎么办?我已经尝试了df.assign(**df.groupby([“Security”,“ID”,g])[“Quantity”].ffill()但它不起作用。@GuillermoCambroneroPérez如果你想向前填充任何特定的列,那么你可以使用df.assign(**df.groupby([“Security”,“ID”,g])[[['Quantity']]].ffill())
或df.assign(**df.groupby([“Security”,“ID”,g])['Quantity'].ffill()。\u frame())
g = pd.Grouper(key='day', freq='60d')
df.assign(**df.groupby(["Security","ID", g]).ffill())
day TradeID Security ID BSType Price Quantity
0 2020-01-01 01 GOOGLE ID001 B 105.901 1000000.0
1 2020-01-03 02 GOOGLE ID001 S 106.969 -300000.0
2 2020-01-04 03 APPLE ID001 B NaN NaN
3 2020-01-05 04 GOOGLE ID001 B 107.037 7500000.0
4 2020-01-06 05 GOOGLE ID001 B 107.038 100000.0
5 2020-01-07 06 GOOGLE ID001 S 107.136 -100000.0
6 2020-01-08 07 GOOGLE ID001 S 107.136 -100000.0
7 2020-01-08 08 GOOGLE ID001 S 107.250 -7800000.0
8 2020-06-09 09 GOOGLE ID001 B NaN NaN