Python 基于组和前一行的正向填充（ffill）_Python_Pandas

Python 基于组和前一行的正向填充（ffill）

python pandas

Python 基于组和前一行的正向填充（ffill）,python,pandas,Python,Pandas,我有一个大的数据帧（400000多行），看起来像这样： data = np.array([ [1949, '01/01/2018', np.nan, 17, '30/11/2017'], [1949, '01/01/2018', np.nan, 19, np.nan], [1811, '01/01/2018', 16, np.nan, '31/11/2017'], [1949, '01/01

我有一个大的数据帧（400000多行），看起来像这样：

data = np.array([
          [1949, '01/01/2018', np.nan, 17,     '30/11/2017'],
          [1949, '01/01/2018', np.nan, 19,      np.nan],
          [1811, '01/01/2018',     16, np.nan, '31/11/2017'],
          [1949, '01/01/2018',     15, 21,     '01/12/2017'],
          [1949, '01/01/2018', np.nan, 20,      np.nan],
          [3212, '01/01/2018',     21, 17,     '31/11/2017']
         ])
columns = ['id', 'ReceivedDate', 'PropertyType', 'MeterType', 'VisitDate']
pd.DataFrame(data, columns=columns)

合成df：

     id     ReceivedDate    PropertyType    MeterType   VisitDate
0   1949    01/01/2018       NaN              17       30/11/2017
1   1949    01/01/2018       NaN              19       NaN
2   1811    01/01/2018       16              NaN       31/11/2017
3   1949    01/01/2018       15               21       01/12/2017
4   1949    01/01/2018       NaN              20       NaN
5   3212    01/01/2018       21               17       31/11/2017

我希望根据groupby（id和接收日期）转发填充-仅当它们在索引中顺序排在下一位时（即，仅转发填充索引位置1和4）

我想有一个专栏，告诉我是否应该根据标准填写，但是我怎么能检查上面的行呢

（我计划按照以下回答使用解决方案：

df.isnull（）.astype（int））.groupby（level=0.cumsum（）.applymap（lambda x:None if x==0 else 1）

x=df.groupby（['id'，'ReceivedDate']）.ffill（）非常慢。）
所需df：
     id     ReceivedDate    PropertyType    MeterType   VisitDate
0   1949    01/01/2018       NaN              17       30/11/2017
1   1949    01/01/2018       NaN              19       30/11/2017
2   1811    01/01/2018       16              NaN       31/11/2017
3   1949    01/01/2018       15               21       01/12/2017
4   1949    01/01/2018       15               20       01/12/2017
5   3212    01/01/2018       21               17       31/11/2017

groupby
和ffill
的limit=1

groupby
带有mask
ing和shift
尝试使用groupby
、mask
和shift
-
i = df[['id', 'ReceivedDate']]
j = i.ne(i.shift().values).any(1).cumsum()


或者

保持循环，直到没有更多匹配项为止（即所有列都是前向填充的）。
df.groupby（['id'，'ReceivedDate']）.ffill（limit=1）
？有时一行中可能有两行，我试图避免df.groupby.ffill
，因为每1000行大约需要1秒的时间（太慢了）。但由于您限制了前向填充的数量，它可能会变得更快？不幸的是还不够，只是在10000行上进行了测试ffill（）
=11.2秒，ffill（limit=1）
=11.1秒。它甚至在第2行上也会向前填充，其中有一个nan（我不想要）-我将更新问题以显示此“边缘情况”@AH为您添加了一个解释。我认为这应该行得通，但我对性能不是100%肯定。因此代码没有检查前一行是否具有相同的id
和ReceivedDate我检查过，它给了我一些想法（它没有完全回答我的问题，但那是因为我第一次问的问题不够好-对此表示抱歉）。谢谢。@那么，答案错了吗？为什么不行，你能帮我理解吗？如果没有用的话，我宁愿把它删掉。还有，时间方面，它有多有用？
i = df[['id', 'ReceivedDate']]
j = i.ne(i.shift().values).any(1).cumsum()

df.mask(df.isnull().astype(int).groupby(j).cumsum().eq(1), df.groupby(j).shift())

df.where(df.isnull().astype(int).groupby(j).cumsum().ne(1), df.groupby(j).shift())

     id ReceivedDate PropertyType MeterType   VisitDate
0  1949   01/01/2018          NaN        17  30/11/2017
1  1949   01/01/2018          NaN        19  30/11/2017
2  1811   01/01/2018           16        18  31/11/2017
3  1949   01/01/2018           15        21  01/12/2017
4  1949   01/01/2018           15        20  01/12/2017
5  3212   01/01/2018           21        17  31/11/2017

cols_to_ffill = ['PropertyType', 'VisitDate']
i = df.copy()

newdata = pd.DataFrame(['placeholder'] )

while not newdata.index.empty:

    RowAboveid = i.id.shift()
    RowAboveRD = i.ReceivedDate.shift()
    rows_with_cols_to_ffill_all_empty = i.loc[:, cols_to_ffill].isnull().all(axis=1)
    rows_to_ffill = (i.ReceivedDate == RowAboveRD) & (i.id == RowAboveid) & (rows_with_cols_to_ffill_all_empty)
    rows_used_to_fill = i[rows_to_ffill].index-1

    newdata = i.loc[rows_used_to_fill, cols_to_ffill]
    newdata.index +=1
    i.loc[rows_to_ffill, cols_to_ffill] = newdata