python-Pandas:groupbyffill用于多列
我有以下数据框,其中有一些缺少的值。我想使用python-Pandas:groupbyffill用于多列,python,pandas,group-by,Python,Pandas,Group By,我有以下数据框,其中有一些缺少的值。我想使用ffill()来填充var1和var2中缺少的值,这些值按date和building分组。我可以一次对一个变量这样做,但当我尝试对两个变量都这样做时,它崩溃了。如何同时对这两个变量执行此操作,同时不修改而是保留var3或var4 df = pd.DataFrame({ 'date': ['2019-01-01','2019-01-01','2019-01-01','2019-01-01','2019-02-01','2019-02-01','2
ffill()
来填充var1
和var2
中缺少的值,这些值按date
和building
分组。我可以一次对一个变量这样做,但当我尝试对两个变量都这样做时,它崩溃了。如何同时对这两个变量执行此操作,同时不修改而是保留var3
或var4
df = pd.DataFrame({
'date': ['2019-01-01','2019-01-01','2019-01-01','2019-01-01','2019-02-01','2019-02-01','2019-02-01','2019-02-01'],
'building': ['a', 'a', 'b', 'b', 'a', 'a', 'b', 'b'],
'var1': [1.5, np.nan, 2.1, 2.2, 1.2, 1.3, 2.4, np.nan],
'var2': [100, 110, 105, np.nan, 102, np.nan, 103, 107],
'var3': [10, 11, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan],
'var4': [1, 2, 3, 4, 5, 6, 7, 8]
})
df
date building var1 var2 var3 var4
0 2019-01-01 a 1.5 100.0 10.0 1
1 2019-01-01 a NaN 110.0 11.0 2
2 2019-01-01 b 2.1 105.0 NaN 3
3 2019-01-01 b 2.2 NaN NaN 4
4 2019-02-01 a 1.2 102.0 NaN 5
5 2019-02-01 a 1.3 NaN NaN 6
6 2019-02-01 b 2.4 103.0 NaN 7
7 2019-02-01 b NaN 107.0 NaN 8
# This works
df['var1'] = df.groupby(['date', 'building'])['var1'].ffill()
df['var2'] = df.groupby(['date', 'building'])['var2'].ffill()
df
date building var1 var2 var3 var4
0 2019-01-01 a 1.5 100.0 10.0 1
1 2019-01-01 a 1.5 110.0 11.0 2
2 2019-01-01 b 2.1 105.0 NaN 3
3 2019-01-01 b 2.2 105.0 NaN 4
4 2019-02-01 a 1.2 102.0 NaN 5
5 2019-02-01 a 1.3 102.0 NaN 6
6 2019-02-01 b 2.4 103.0 NaN 7
7 2019-02-01 b 2.4 107.0 NaN 8
# This doesn't work
df[['var1', 'var2']] = df.groupby(['date', 'building'])[['var1', 'var2']].ffill()
ValueError: Columns must be same length as key
以迭代方式进行:
gb = df.groupby(['date', 'building'])
for g in ["var1", "var2"]:
df[g] = gb[g].ffill()
date building var1 var2 var3 var4
0 2019-01-01 a 1.5 100.0 10.0 1
1 2019-01-01 a 1.5 110.0 11.0 2
2 2019-01-01 b 2.1 105.0 NaN 3
3 2019-01-01 b 2.2 105.0 NaN 4
4 2019-02-01 a 1.2 102.0 NaN 5
5 2019-02-01 a 1.3 102.0 NaN 6
6 2019-02-01 b 2.4 103.0 NaN 7
7 2019-02-01 b 2.4 107.0 NaN 8
以迭代方式进行:
gb = df.groupby(['date', 'building'])
for g in ["var1", "var2"]:
df[g] = gb[g].ffill()
date building var1 var2 var3 var4
0 2019-01-01 a 1.5 100.0 10.0 1
1 2019-01-01 a 1.5 110.0 11.0 2
2 2019-01-01 b 2.1 105.0 NaN 3
3 2019-01-01 b 2.2 105.0 NaN 4
4 2019-02-01 a 1.2 102.0 NaN 5
5 2019-02-01 a 1.3 102.0 NaN 6
6 2019-02-01 b 2.4 103.0 NaN 7
7 2019-02-01 b 2.4 107.0 NaN 8
@Gaurav Bansal在数据框中拟合group by时,您只是缺少了一些列
df[['date','building','var1','var2']]=df.groupby(['date','building'])[['var1','var2']].ffill()
Group by将返回四列数据框,即“日期”、“建筑”、“var1”和“var2”,或者您可以只提供一个数据框来存储处理过的数据框
因此,您需要将其存储到一个四列df中,以获得返回的键值的完美匹配。@Gaurav Bansal在数据帧中拟合group by时,您只是缺少了几列
df[['date','building','var1','var2']]=df.groupby(['date','building'])[['var1','var2']].ffill()
Group by将返回四列数据框,即“日期”、“建筑”、“var1”和“var2”,或者您可以只提供一个数据框来存储处理过的数据框
因此,您需要将其存储到一个四列df中,以获得返回的键值的完美匹配。我认为您需要在
groupby
之前添加fillna
df[["var1", "var2"]] = df[["var1", "var2"]].fillna(df.groupby(['date', 'building'])[["var1", "var2"]].ffill())
date building var1 var2 var3 var4
0 2019-01-01 a 1.5 100.0 10.0 1
1 2019-01-01 a 1.5 110.0 11.0 2
2 2019-01-01 b 2.1 105.0 NaN 3
3 2019-01-01 b 2.2 105.0 NaN 4
4 2019-02-01 a 1.2 102.0 NaN 5
5 2019-02-01 a 1.3 102.0 NaN 6
6 2019-02-01 b 2.4 103.0 NaN 7
7 2019-02-01 b 2.4 107.0 NaN 8
我认为您需要在您的
groupby
之前添加fillna
df[["var1", "var2"]] = df[["var1", "var2"]].fillna(df.groupby(['date', 'building'])[["var1", "var2"]].ffill())
date building var1 var2 var3 var4
0 2019-01-01 a 1.5 100.0 10.0 1
1 2019-01-01 a 1.5 110.0 11.0 2
2 2019-01-01 b 2.1 105.0 NaN 3
3 2019-01-01 b 2.2 105.0 NaN 4
4 2019-02-01 a 1.2 102.0 NaN 5
5 2019-02-01 a 1.3 102.0 NaN 6
6 2019-02-01 b 2.4 103.0 NaN 7
7 2019-02-01 b 2.4 107.0 NaN 8
这里的问题是只保留
var1
和var2
。我修改了我的问题,以包括不应该删除或修改的其他变量。这里的问题是只保留var1
和var2
。我修改了我的问题,加入了其他不应该删除或修改的变量。