Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/282.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 用以前的非缺失值填充缺失的数据,按键分组_Python_Pandas_Nan_Missing Data_Data Cleaning - Fatal编程技术网

Python 用以前的非缺失值填充缺失的数据,按键分组

Python 用以前的非缺失值填充缺失的数据,按键分组,python,pandas,nan,missing-data,data-cleaning,Python,Pandas,Nan,Missing Data,Data Cleaning,我处理的数据帧如下: id x 0 1 10 1 1 20 2 2 100 3 2 200 4 1 NaN 5 2 NaN 6 1 300 7 1 NaN 我想用具有相同“id”值的行中的前一个非NAN“x”替换每个NAN“x”: id x 0 1 10 1 1 20 2 2 100 3 2 200 4 1 20 5 2 200 6 1 300 7 1 300 是

我处理的数据帧如下:

   id    x
0   1   10
1   1   20
2   2  100
3   2  200
4   1  NaN
5   2  NaN
6   1  300
7   1  NaN
我想用具有相同“id”值的行中的前一个非NAN“x”替换每个NAN“x”:

   id    x
0   1   10
1   1   20
2   2  100
3   2  200
4   1   20
5   2  200
6   1  300
7   1  300
是否有一些巧妙的方法可以在不手动循环行的情况下执行此操作?

您可以对每个组执行操作:

import numpy as np
import pandas as pd

df = pd.DataFrame({'id': [1,1,2,2,1,2,1,1], 'x':[10,20,100,200,np.nan,np.nan,300,np.nan]})
df['x'] = df.groupby(['id'])['x'].ffill()
print(df)
屈服

   id      x
0   1   10.0
1   1   20.0
2   2  100.0
3   2  200.0
4   1   20.0
5   2  200.0
6   1  300.0
7   1  300.0
使用sort_值、groupby和ffill,这样,如果第一个值或一组第一个值有
Nan
值,它们也会被填充。

多键问题的解决方案: 在本例中,数据具有键[date,region,type]。Date是原始数据帧上的索引

import os
import pandas as pd

#sort to make indexing faster
df.sort_values(by=['date','region','type'], inplace=True)

#collect all possible regions and types
regions = list(set(df['region']))
types = list(set(df['type']))

#record column names
df_cols = df.columns

#delete ffill_df.csv so we can begin anew
try:
    os.remove('ffill_df.csv')
except FileNotFoundError:
    pass

# steps:
# 1) grab rows with a particular region and type
# 2) use forwardfill to fill nulls
# 3) use backwardfill to fill remaining nulls
# 4) append to file
for r in regions:
    for t in types:
        group_df = df[(df.region == r) & (df.type == t)].copy()
        group_df.fillna(method='ffill', inplace=True)
        group_df.fillna(method='bfill', inplace=True)
        group_df.to_csv('ffill_df.csv', mode='a', header=False, index=True) 
检查结果:

#load in the ffill_df
ffill_df = pd.read_csv('ffill_df.csv', header=None, index_col=None)
ffill_df.columns = df_reindexed_cols
ffill_df.index= ffill_df.date
ffill_df.drop('date', axis=1, inplace=True)
ffill_df.head()

#compare new and old dataframe
print(df.shape)        
print(ffill_df.shape)
print()
print(pd.isnull(ffill_df).sum())

“ffill”选项正是我所需要的。谢谢您也可以使用
df['x']=df.groupby('id').fillna(method='ffill')
来实现同样的功能,语法稍微简单一些。@Zhang18:谢谢您的改进
df.groupby(['id']).ffill()
也可以。
#load in the ffill_df
ffill_df = pd.read_csv('ffill_df.csv', header=None, index_col=None)
ffill_df.columns = df_reindexed_cols
ffill_df.index= ffill_df.date
ffill_df.drop('date', axis=1, inplace=True)
ffill_df.head()

#compare new and old dataframe
print(df.shape)        
print(ffill_df.shape)
print()
print(pd.isnull(ffill_df).sum())