Python 在dataframe中组合groupby后创建共享变量
我在描述我的问题时遇到了困难,所以我会马上开始。以下是一些测试数据:Python 在dataframe中组合groupby后创建共享变量,python,python-3.x,pandas,dataframe,pandas-groupby,Python,Python 3.x,Pandas,Dataframe,Pandas Groupby,我在描述我的问题时遇到了困难,所以我会马上开始。以下是一些测试数据: import pandas as pd df = pd.DataFrame(data={"family":["Smith","Miller","Simpson","Miller","Simpson","Smith","Miller","Simpson","Miller"], "first_name":["Anna","Bart","Lisa","Ida","Paul","Bridget"
import pandas as pd
df = pd.DataFrame(data={"family":["Smith","Miller","Simpson","Miller","Simpson","Smith","Miller","Simpson","Miller"],
"first_name":["Anna","Bart","Lisa","Ida","Paul","Bridget","Harry","Dustin","George"],
"shirt_color":["green","yellow","red","yellow","green","red","yellow","red","red"]})
现在,我想在原始数据框中创建一个新列,其中包含每个族的shirt_颜色份额,因此带有例如family Miller和shirt_color yellow的每一行都具有相同的值0.75,等等
我尝试过几种方法,但都没有成功
df = df.groupby("family").apply(lambda x: x.groupby("shirt_color").apply(lambda x: x.size()/familysize))
这似乎很有希望,但正如您所看到的,我无法再访问上一个lambda函数中的家庭成员数。我还尝试创建一个仅为family的groupby对象并迭代数据帧,将所有数据帧按颜色分别分组,但最终我无法将数据帧放回一个
对于数据帧来说,这似乎不是一件非常奇特的事情,所以我确信有一种简单的方法可以做到这一点,但我已经没有主意了
非常感谢您的帮助 你就快到了。只需使用不同的变量名。通过同时使用
x
可以覆盖上一个变量,并且无法访问它
df.groupby("family").apply(lambda s: s.groupby("shirt_color").apply(lambda x: x.size/s.size))
family shirt_color
Miller red 0.250000
yellow 0.750000
Simpson green 0.333333
red 0.666667
Smith green 0.500000
red 0.500000
dtype: float64
在我看来,应该避免
apply
,因为这会导致效率低下的Python级循环。这里有一个使用GroupBy
+transform
的替代解决方案:
f = df.groupby('family')['first_name'].transform('size')
g = df.groupby(['family', 'shirt_color'])['first_name'].transform('size')
df['ratio'] = g / f
print(df)
family first_name shirt_color ratio
0 Smith Anna green 0.500000
1 Miller Bart yellow 0.750000
2 Simpson Lisa red 0.666667
3 Miller Ida yellow 0.750000
4 Simpson Paul green 0.333333
5 Smith Bridget red 0.500000
6 Miller Harry yellow 0.750000
7 Simpson Dustin red 0.666667
8 Miller George red 0.250000
尝试:
使用
值\u计数
和合并
:
s = (df.groupby('family').shirt_color
.value_counts(normalize=True).rename('ratio').reset_index())
要将其放回初始数据帧,请执行以下操作:
df.merge(s)
从不知道
normalize=True
+1.
family shirt_color ratio
0 Miller yellow 0.750000
1 Miller red 0.250000
2 Simpson red 0.666667
3 Simpson green 0.333333
4 Smith green 0.500000
5 Smith red 0.500000
df.merge(s)
family first_name shirt_color ratio
0 Smith Anna green 0.500000
1 Miller Bart yellow 0.750000
2 Miller Ida yellow 0.750000
3 Miller Harry yellow 0.750000
4 Simpson Lisa red 0.666667
5 Simpson Dustin red 0.666667
6 Simpson Paul green 0.333333
7 Smith Bridget red 0.500000
8 Miller George red 0.250000