Python 基于其他列ID从现有数据帧获取新数据帧中的摘要数据列_Python_Pandas_Group By

Python 基于其他列ID从现有数据帧获取新数据帧中的摘要数据列

python pandas

Python 基于其他列ID从现有数据帧获取新数据帧中的摘要数据列,python,pandas,group-by,Python,Pandas,Group By,我想在一个数据框中汇总数据，并将新列添加到另一个数据框中。我的数据包含带有ID号的公寓，它具有公寓中每个房间的表面和体积值。我想要的是有一个数据框，总结这一点，并给我每个公寓的总表面和体积。原始数据帧有两个条件： Two conditions: - the dataframe can contain empty cells - when the values of surface or volume are equal for all of the rows within that ID (s

我想在一个数据框中汇总数据，并将新列添加到另一个数据框中。我的数据包含带有ID号的公寓，它具有公寓中每个房间的表面和体积值。我想要的是有一个数据框，总结这一点，并给我每个公寓的总表面和体积。原始数据帧有两个条件：

Two conditions:
- the dataframe can contain empty cells
- when the values of surface or volume are equal for all of the rows within that ID 
(so all the same values for the same ID), then the data (surface, volumes) is not 
summed but one value/row is passed to the new summary column (example: 'ID 4')(as 
this could be a mistake in the original dataframe and the total surface/volume was 
inserted for all the rooms by the government-employee)

初始数据帧“数据”：

print(data)

    ID  Surface  Volume
0    2     10.0    25.0
1    2     12.0    30.0
2    2     24.0    60.0
3    2      8.0    20.0
4    4     84.0   200.0
5    4     84.0   200.0
6    4     84.0   200.0
7   52      NaN     NaN
8   52     96.0   240.0
9   95      8.0    20.0
10  95      6.0    15.0
11  95     12.0    30.0
12  95     30.0    75.0
13  95     12.0    30.0

import pandas as pd 
import numpy as np  

df = pd.DataFrame({"ID": [2,4,52,95]})  

data = pd.DataFrame({"ID":  [2,2,2,2,4,4,4,52,52,95,95,95,95,95],                 
                "Surface":  [10,12,24,8,84,84,84,np.nan,96,8,6,12,30,12],                  
                 "Volume":  [25,30,60,20,200,200,200,np.nan,240,20,15,30,75,30]})  
print(data)  


#Tried something, but no idea how to do this actually:

df["Surface"] = data.groupby("ID").agg(sum) 
df["Volume"] = data.groupby("ID").agg(sum)
print(df)

来自“df”的所需输出：

print(df)
    ID  Surface  Volume
0    2     54.0   135.0
1    4     84.0   200.0  #-> as the values are the same for each row of this ID in the original data, the sum is not taken, but only one of the rows is passed (see the second condition)
2   52     96.0   240.0
3   95     68.0   170.0

已尝试代码：

print(data)

    ID  Surface  Volume
0    2     10.0    25.0
1    2     12.0    30.0
2    2     24.0    60.0
3    2      8.0    20.0
4    4     84.0   200.0
5    4     84.0   200.0
6    4     84.0   200.0
7   52      NaN     NaN
8   52     96.0   240.0
9   95      8.0    20.0
10  95      6.0    15.0
11  95     12.0    30.0
12  95     30.0    75.0
13  95     12.0    30.0

import pandas as pd 
import numpy as np  

df = pd.DataFrame({"ID": [2,4,52,95]})  

data = pd.DataFrame({"ID":  [2,2,2,2,4,4,4,52,52,95,95,95,95,95],                 
                "Surface":  [10,12,24,8,84,84,84,np.nan,96,8,6,12,30,12],                  
                 "Volume":  [25,30,60,20,200,200,200,np.nan,240,20,15,30,75,30]})  
print(data)  


#Tried something, but no idea how to do this actually:

df["Surface"] = data.groupby("ID").agg(sum) 
df["Volume"] = data.groupby("ID").agg(sum)
print(df)

这里有两个必要的条件-第一个条件是通过和分别测试各组的唯一值，并通过

eq

比较是否等于

，然后第二个条件-每个列使用

ID

列

按

将两个掩码按位

和

链接，并按

NaN

s和最后一次聚合

和

重新分配匹配值：

cols = ['Surface','Volume']
m1 = data.groupby("ID")[cols].transform('nunique').eq(1)
m2 = data[cols].apply(lambda x: x.to_frame().join(data['ID']).duplicated())

df = data[cols].mask(m1 & m2).groupby(data["ID"]).sum().reset_index()
print(df)
   ID  Surface  Volume
0   2     54.0   135.0
1   4     84.0   200.0
2  52     96.0   240.0
3  95     68.0   170.0

如果需要使用聚合总和值填充新列，请使用：

这个操作正是我的意思（在打印的数据框中只缺少索引）。我是否可以将新数据从dataframe“data”立即传递到现有dataframe“df”中，或者在这之后我是否需要一些东西，如->df['Surface']=data['Surface']？您是否也对建立在这一问题基础上的下一个问题有想法/建议（关于汇总加权平均列）？->@Matthi9000-似乎答案很好，使用它；）