Python 处理熊猫中的稀疏类别-将所有不在顶级类别中的内容替换为；其他"；_Python_Pandas_Dataframe_Counter_Data Cleaning

Python 处理熊猫中的稀疏类别-将所有不在顶级类别中的内容替换为；其他"；

python pandas dataframe

Python 处理熊猫中的稀疏类别-将所有不在顶级类别中的内容替换为；其他"；,python,pandas,dataframe,counter,data-cleaning,Python,Pandas,Dataframe,Counter,Data Cleaning,在清理数据时，我经常遇到以下常见问题还有一些更常见的类别（比如说前十大电影类型），还有很多其他的类别，它们都很稀少。这里通常的做法是将稀疏的体裁组合成“其他”体裁当稀疏类别不多时，可以轻松完成： # Join bungalows as they are sparse classes into 1 df.property_type.replace(['Terraced bungalow','Detached bungalow', 'Semi-detached bungalow'], 'Bung

在清理数据时，我经常遇到以下常见问题还有一些更常见的类别（比如说前十大电影类型），还有很多其他的类别，它们都很稀少。这里通常的做法是将稀疏的体裁组合成“其他”体裁

当稀疏类别不多时，可以轻松完成：

# Join bungalows as they are sparse classes into 1
df.property_type.replace(['Terraced bungalow','Detached bungalow', 'Semi-detached bungalow'], 'Bungalow', inplace=True)

但是，例如，如果我有一个电影数据集，其中大部分电影都是由8家大制片厂制作的，我想把其他所有东西都合并到“其他”制片厂中，那么获得前8家制片厂是有意义的：

top_8_list = []
top_8 = df.studio.value_counts().head(8)
for key, value in top_8.iteritems():
    top_8_list.append(key)

top_8_list
top_8_list
['Universal Pictures',
 'Warner Bros.',
 'Paramount Pictures',
 'Twentieth Century Fox Film Corporation',
 'New Line Cinema',
 'Columbia Pictures Corporation',
 'Touchstone Pictures',
 'Columbia Pictures']

然后做一些类似的事情

将studio不在前8名列表中的studio替换为“其他”

所以问题是，如果有人知道熊猫有什么优雅的解决办法？这是非常常见的数据清理任务

您可以将列转换为具有额外内存优势的类型：

top_cats = df.studio.value_counts().head(8).index.tolist() + ['other']
df['studio'] = pd.Categorical(df['studio'], categories=top_cats).fillna('other')

可以与布尔索引一起使用：

df.loc[~df['studio'].isin(top_8_list), 'studio'] = 'Other'

注意：无需通过手动

for

循环来构建前8名工作室的列表：

top_8_list = df['studio'].value_counts().index[:8]

这实际上会将前8名替换为other@DanielLennart杜普，你说得很对。。。这是漫长的一周！我已经更新了我的答案。很抱歉，我以前没有见过使用“分类”的这种方法，我会尝试一下。这种方法也很有效，谢谢。事实上，我可以看到在转换为0.749359130859375 MB的类别后内存减少，而6000个观测数据集的内存为0.709827423957031 MB。要真正充分利用pd.Category，我认为在使用

值\u计数

之前，应该先对序列进行分类。