Python 使用基于索引的限制向前填充列_Python_Pandas_Dataframe_Pandas Groupby_Imputation

Python 使用基于索引的限制向前填充列

python pandas dataframe

Python 使用基于索引的限制向前填充列,python,pandas,dataframe,pandas-groupby,imputation,Python,Pandas,Dataframe,Pandas Groupby,Imputation,我想向前填充一列，我想指定一个限制，但我希望限制基于索引，而不是像limit允许的那样简单的行数例如，假设我的数据帧由以下公式给出： df=pd.DataFrame({ “数据”：[0.0,1.0,np.nan,3.0,np.nan,5.0,np.nan,np.nan,np.nan,np.nan,np.nan]， “组”：[0,0,0,1,1,0,0,0,1,1] }) 看起来像 [27]中的：df 出[27]：数据组 0 0.0 0 1 1.0 0 2.0 3

我想向前填充一列，我想指定一个限制，但我希望限制基于索引，而不是像limit允许的那样简单的行数

例如，假设我的数据帧由以下公式给出：

df=pd.DataFrame({
“数据”：[0.0,1.0,np.nan,3.0,np.nan,5.0,np.nan,np.nan,np.nan,np.nan,np.nan]，
“组”：[0,0,0,1,1,0,0,0,1,1]
})

看起来像

[27]中的

：df
出[27]：
数据组
0   0.0      0
1   1.0      0
2.0
3   3.0      1
4南1
5   5.0      0
6.0
7.0
8南1
9南1

如果我按

组

列分组，并用

限制=2向前填充该组，则生成的数据帧将为
[35]中的：df.groupby（'group'）.ffill（limit=2）
出[35]：
组数据
0      0   0.0
1      0   1.0
2      0   1.0
3      1   3.0
4      1   3.0
5      0   5.0
6      0   5.0
7      0   5.0
8      1   3.0
9.1南

然而，我实际上想在这里做的是，只向前填充到索引在每个组的第一个索引的2以内的行，而不是每个组的下2行。例如，如果我们只查看数据帧上的组：
[36]中的：对于i，df.groupby中的group（'group'）：
…：打印（组）
...:
数据组
0   0.0      0
1   1.0      0
2.0
5   5.0      0
6.0
7.0
数据组
3   3.0      1
4南1
8南1
9南1

我希望这里的第二组只向前填充到索引4，而不是索引8和9。第一组的NaN值都在最后一个非NaN值的2个索引内，因此它们将被完全填充。生成的数据帧如下所示：
组数据
0      0   0.0
1      0   1.0
2      0   1.0
3      1   3.0
4      1   3.0
5      0   5.0
6      0   5.0
7      0   5.0
81南
9.1南

FWIW在我的实际用例中，我的索引是一个DateTimeIndex（它被排序）
我目前有一个解决方案，需要循环通过组索引上过滤的数据帧，为每个事件创建一个基于索引的非NaN值的时间范围，然后组合这些时间范围。但这太慢了，不实用
import numpy as np
import pandas as pd
df = pd.DataFrame({
    'data': [0.0, 1.0, 1, 3.0, np.nan, 22, np.nan, 5, np.nan, np.nan],
    'group': [0, 0, 1, 0, 1, 0, 1, 0, 1, 1]})

df = df.reset_index()
df['stop_index'] = df['index'] + 2
df['stop_index'] = df['stop_index'].where(pd.notnull(df['data']))
df['stop_index'] = df.groupby('group')['stop_index'].ffill()
df['mask'] = df['index'] <= df['stop_index']
df.loc[df['mask'], 'data'] = df.groupby('group')['data'].ffill()
print(df)
#    index  data  group  stop_index   mask
# 0      0   0.0      0         2.0   True
# 1      1   1.0      0         3.0   True
# 2      2   1.0      1         4.0   True
# 3      3   3.0      0         5.0   True
# 4      4   1.0      1         4.0   True
# 5      5  22.0      0         7.0   True
# 6      6   NaN      1         4.0  False
# 7      7   5.0      0         9.0   True
# 8      8   NaN      1         4.0  False
# 9      9   NaN      1         4.0  False

# clean up df
df = df[['data', 'group']]
print(df)


这会将索引复制到列中，然后
生成第二个stop\u index
列，该列是索引的大小
（时间）窗口
然后它在stop\u index
中生成空行，以匹配数据中的空行：
df['stop_index'] = df['stop_index'].where(pd.notnull(df['data']))

df['mask'] = df['index'] <= df['stop_index']
df.loc[df['mask'], 'data'] = df.groupby('group')['data'].ffill()

然后，它按组向前填充stop\u索引
：
df['stop_index'] = df.groupby('group')['stop_index'].ffill()

现在（最后）我们可以定义所需的掩码
——我们实际想要向前填充数据的位置
：
df['stop_index'] = df['stop_index'].where(pd.notnull(df['data']))

df['mask'] = df['index'] <= df['stop_index']
df.loc[df['mask'], 'data'] = df.groupby('group')['data'].ffill()

df['mask']=df['index']IIUC
测试数据
   data  group
0   0.0      0
1   1.0      0
2   1.0      1
3   3.0      0
4   NaN      1
5   22       0
6   NaN      1
7   5.0      0
8   NaN      1
9   NaN      1

   data  group
0   0.0    0.0
1   1.0    0.0
2   1.0    1.0
3   3.0    0.0
4   1.0    1.0
5  22.0    0.0
6   NaN    1.0# here not change , since the previous two do not have valid value for group 1 
7   5.0    0.0
8   NaN    1.0
9   NaN    1.0

我的数据测试方法
   data  group
0   0.0      0
1   1.0      0
2   1.0      1
3   3.0      0
4   NaN      1
5   22       0
6   NaN      1
7   5.0      0
8   NaN      1
9   NaN      1

   data  group
0   0.0    0.0
1   1.0    0.0
2   1.0    1.0
3   3.0    0.0
4   1.0    1.0
5  22.0    0.0
6   NaN    1.0# here not change , since the previous two do not have valid value for group 1 
7   5.0    0.0
8   NaN    1.0
9   NaN    1.0

用unutbu发出
   data  group
0   0.0      0
1   1.0      0
2   1.0      1
3   3.0      0
4   1.0      1
5  22.0      0
6   1.0      1# miss match in here
7   5.0      0
8   NaN      1
9   NaN      1

您好，您想测试我答案中的测试数据吗？我无法匹配输出..我在想reindex可能参与了一个解决方案。你能稍微解释一下方法链吗？@AlexanderReynolds将子df的索引重新编制到原始df的索引中，子df中未显示的所有行都将是NaN，然后我们只需要带限制的正常ffill，因为索引在重新编制索引后继续，这很有意义！是的，你用另一个答案强调了这个问题，并且正确地理解了我。我将给它一点时间，看看是否有任何方法可以在不显式使用groupby中的索引的情况下实现它。顺便说一句，您可以使用df.groupby（…）.groups
这是一个字典，其中的值是索引，而不是手动使用group.index
将其取出。因此，对于df.groupby（…）.groups.values（）中的idx，
.Hmm…实际上，经过再三考虑，我认为这也不是答案。问题是，这仍然是对具有此功能的行数进行索引，而不是根据索引的值进行任意截断，否？@AlexanderReynolds您是否可以在某些边缘情况下运行该方法，以查看是否有效修改示例以使用DateTimeIndex会有所帮助，而且要设计出一个价值观，这个价值观实际上能够运用你想要的所有条件，并得到一个有效的答案。@unutbu的确，要正确地模拟是很困难的。但我同意这样一个例子对未来的读者更有益。我想说得更笼统一些，但这只是为了有限的利益（如果有的话）增加了混乱。