Python 计数重复行条目，提高RAM效率_Python_Pandas_Pandas Groupby

Python 计数重复行条目，提高RAM效率

python pandas

Python 计数重复行条目，提高RAM效率,python,pandas,pandas-groupby,Python,Pandas,Pandas Groupby,我有一个大的数据框，希望得到每行的数量。我一直在使用这个： df.groupby(df.columns.tolist(), as_index=False, sort=False).size() 但它需要超过60GB的RAM，而我只能使用32GB 然后我想到了这个，但速度非常慢，连续搅拌了5个多小时： DF有3列；两个范畴和一个字符串 from collections import Counter counts = df.groupby(['industry', 'sector'], as_

我有一个大的数据框，希望得到每行的数量。我一直在使用这个：

df.groupby(df.columns.tolist(), as_index=False, sort=False).size()

但它需要超过60GB的RAM，而我只能使用32GB

然后我想到了这个，但速度非常慢，连续搅拌了5个多小时： DF有3列；两个范畴和一个字符串

from collections import Counter

counts = df.groupby(['industry', 'sector'], as_index=False, sort=False).aggregate(Counter)

final_df = pd.DataFrame()
for row in counts.iterrows():
    other = row[1][:-1].to_dict() 
    for job, n in row[1][-1].items():
        tmp_df = pd.DataFrame({
            **other,
            'job.jobTitlText': job,
            'size': n,
        }, index=[0])
        final_df = final_df.append(tmp_df) # append the tmp_df to our final df
final_df.reset_index(drop=True)

其中一列是分类/枚举列

事实证明，默认情况下，Pandas将为类别生成组，即使这些类别在数据中不存在

解决方案是使用

observed=True

，因此：

counts = df.groupby(df.columns.tolist(), as_index=False, sort=False, observed=True).size()

打印（df.value\u counts（））怎么样？很好的发现。。。。请注意，您的问题中有分类数据。