Python 在dataframe中组合groupby后创建共享变量_Python_Python 3.x_Pandas_Dataframe_Pandas Groupby

Python 在dataframe中组合groupby后创建共享变量

python python-3.x pandas dataframe

Python 在dataframe中组合groupby后创建共享变量,python,python-3.x,pandas,dataframe,pandas-groupby,Python,Python 3.x,Pandas,Dataframe,Pandas Groupby,我在描述我的问题时遇到了困难，所以我会马上开始。以下是一些测试数据： import pandas as pd df = pd.DataFrame(data={"family":["Smith","Miller","Simpson","Miller","Simpson","Smith","Miller","Simpson","Miller"], "first_name":["Anna","Bart","Lisa","Ida","Paul","Bridget"

我在描述我的问题时遇到了困难，所以我会马上开始。以下是一些测试数据：

import pandas as pd
df = pd.DataFrame(data={"family":["Smith","Miller","Simpson","Miller","Simpson","Smith","Miller","Simpson","Miller"],
                    "first_name":["Anna","Bart","Lisa","Ida","Paul","Bridget","Harry","Dustin","George"],
                    "shirt_color":["green","yellow","red","yellow","green","red","yellow","red","red"]})

现在，我想在原始数据框中创建一个新列，其中包含每个族的shirt_颜色份额，因此带有例如family Miller和shirt_color yellow的每一行都具有相同的值0.75，等等

我尝试过几种方法，但都没有成功

df = df.groupby("family").apply(lambda x: x.groupby("shirt_color").apply(lambda x: x.size()/familysize))

这似乎很有希望，但正如您所看到的，我无法再访问上一个lambda函数中的家庭成员数。我还尝试创建一个仅为family的groupby对象并迭代数据帧，将所有数据帧按颜色分别分组，但最终我无法将数据帧放回一个

对于数据帧来说，这似乎不是一件非常奇特的事情，所以我确信有一种简单的方法可以做到这一点，但我已经没有主意了

非常感谢您的帮助

你就快到了。只需使用不同的变量名。通过同时使用

可以覆盖上一个变量，并且无法访问它

df.groupby("family").apply(lambda s: s.groupby("shirt_color").apply(lambda x: x.size/s.size))

family   shirt_color
Miller   red            0.250000
         yellow         0.750000
Simpson  green          0.333333
         red            0.666667
Smith    green          0.500000
         red            0.500000
dtype: float64

在我看来，应该避免

apply

，因为这会导致效率低下的Python级循环。这里有一个使用

GroupBy

transform

的替代解决方案：

f = df.groupby('family')['first_name'].transform('size')
g = df.groupby(['family', 'shirt_color'])['first_name'].transform('size')

df['ratio'] = g / f

print(df)

    family first_name shirt_color     ratio
0    Smith       Anna       green  0.500000
1   Miller       Bart      yellow  0.750000
2  Simpson       Lisa         red  0.666667
3   Miller        Ida      yellow  0.750000
4  Simpson       Paul       green  0.333333
5    Smith    Bridget         red  0.500000
6   Miller      Harry      yellow  0.750000
7  Simpson     Dustin         red  0.666667
8   Miller     George         red  0.250000

尝试：

使用

值\u计数

和

合并

：

s = (df.groupby('family').shirt_color
        .value_counts(normalize=True).rename('ratio').reset_index())

要将其放回初始数据帧，请执行以下操作：

df.merge(s)

从不知道

normalize=True

+1.

    family shirt_color     ratio
0   Miller      yellow  0.750000
1   Miller         red  0.250000
2  Simpson         red  0.666667
3  Simpson       green  0.333333
4    Smith       green  0.500000
5    Smith         red  0.500000

df.merge(s)

    family first_name shirt_color     ratio
0    Smith       Anna       green  0.500000
1   Miller       Bart      yellow  0.750000
2   Miller        Ida      yellow  0.750000
3   Miller      Harry      yellow  0.750000
4  Simpson       Lisa         red  0.666667
5  Simpson     Dustin         red  0.666667
6  Simpson       Paul       green  0.333333
7    Smith    Bridget         red  0.500000
8   Miller     George         red  0.250000