Python 具有最小组大小的分组_Python_Python 2.7_Pandas_Grouping

Python 具有最小组大小的分组

python python-2.7 pandas

Python 具有最小组大小的分组,python,python-2.7,pandas,grouping,Python,Python 2.7,Pandas,Grouping,我有一个形状为（450000,15）的数据框df，包含用户信息，每一行都是不同的用户，有13个特征（年龄、性别、家乡…）和1个布尔变量，无论用户是否有车我想重新组合我的用户，找出哪些组拥有最多的汽车，但我需要在一个组中保留至少2500个用户，以保持其统计相关性 test= df.groupby(['Gender']) test.size() # check the groups size 到目前为止还不错，我有超过2500个用户分组。所以我有另一个分组标准： test2= df.groupb

我有一个形状为（450000,15）的数据框

df

，包含用户信息，每一行都是不同的用户，有13个特征（年龄、性别、家乡…）和1个布尔变量，无论用户是否有车

我想重新组合我的用户，找出哪些组拥有最多的汽车，但我需要在一个组中保留至少2500个用户，以保持其统计相关性

test= df.groupby(['Gender'])
test.size() # check the groups size

到目前为止还不错，我有超过2500个用户分组。所以我有另一个分组标准：

test2= df.groupby(['Gender','Age'])  
test2.size()

性别年龄
女性groupby
必须在一个“键”上分组，该键必须可为每行单独计算。也就是说，没有办法根据某个标准进行分组，该标准取决于在创建组（如其大小）之前您不会知道的聚合特征。您可以编写代码来尝试不同的分组，并使用一些启发式方法来决定哪一个是“最好的”，但没有内置的功能。
是否希望所有分组至少有2500个用户
你可以这样做：
# List of all sets of categories you want to test
group_ids_list = [['Gender'], ['Age'], ['Gender','Age']]
# Will be filled with groups that pass your test
valid_groups = []
group_sizes = {}

for group_ids in group_ids_list :

    grouped_df = df.groupby(group_id)
    for key, group in grouped_df:
        if len(group) > 25000:
            valid_groups.append(group)
            group_sizes[key] = len(group) 

group_sizes = pd.Series(group_sizes)

然后，您可以只使用有效的Grouper。
希望伪代码有帮助，否则请提供一个可复制的示例。
我认为FLab的答案可能更完整、更正确。但如果你想快速解决问题
column = 'Gender'
minimum_size = 2500

valid_groups = [g for g in set(df[col]) if sum(df[col] == g) >= minimum_size]
mask = df[column].isin(valid_groups)
df[mask].groupby(column)

Gender   
Female   150 000 # Don't split here because groups will be too small

# Here I can split, because group size > 2500 :
Gender   Age
Male     <20     5040 
         20-90   291930
         90+     3030    
dtype: int64

name_of_group  group_size
Female         150000
Male, <20      5040
Male, 20-90    291930
Male, 90+      3030

# List of all sets of categories you want to test
group_ids_list = [['Gender'], ['Age'], ['Gender','Age']]
# Will be filled with groups that pass your test
valid_groups = []
group_sizes = {}

for group_ids in group_ids_list :

    grouped_df = df.groupby(group_id)
    for key, group in grouped_df:
        if len(group) > 25000:
            valid_groups.append(group)
            group_sizes[key] = len(group) 

group_sizes = pd.Series(group_sizes)

column = 'Gender'
minimum_size = 2500

valid_groups = [g for g in set(df[col]) if sum(df[col] == g) >= minimum_size]
mask = df[column].isin(valid_groups)
df[mask].groupby(column)