Machine learning 如何处理对数据集中的多个列应用一个热编码后产生的大量恐惧？_Machine Learning_Linear Regression_Prediction_Feature Selection_One Hot Encoding

Machine learning 如何处理对数据集中的多个列应用一个热编码后产生的大量恐惧？

machine-learning

Machine learning 如何处理对数据集中的多个列应用一个热编码后产生的大量恐惧？,machine-learning,linear-regression,prediction,feature-selection,one-hot-encoding,Machine Learning,Linear Regression,Prediction,Feature Selection,One Hot Encoding,我正在使用kaggle的TMDB 5000电影数据集： https://www.kaggle.com/tmdb/tmdb-movie-metadata 在预处理阶段，我使用MultiLabelBinarizer（）对数据集中的列进行编码，如： - Genres, production_countries, production_companies, Cast 现在，我有了大量的功能。如何解决这个问题 from sklearn.preprocessing import MultiLabelB

我正在使用kaggle的TMDB 5000电影数据集：

https://www.kaggle.com/tmdb/tmdb-movie-metadata

在预处理阶段，我使用

MultiLabelBinarizer（）

对数据集中的列进行编码，如：

 - Genres, production_countries, production_companies, Cast

现在，我有了大量的功能。如何解决这个问题

from sklearn.preprocessing import MultiLabelBinarizer()

在使用一种热编码之前，请检查您的标称特征，并仅选择足够频繁的值。您可以将其他值切换为字符串“其他”。例如-如果只想保留100个最频繁的值：

val_freq = df[your_column].value_counts() #finds the frequencies of the values and sorts them
good_vals = val_freq[:100].index #takes the top 100 values
df[your_column][~df['your_column'].isin(good_vals)]='Other' #replaces the values not in the top 100 by "Other"

一般来说，OHE不是具有高基数的分类特征的推荐方法。例如，您可以使用频率编码（也可以只保留上面建议的最频繁的标签），或二进制编码或数字编码所有这些都是很好的解决方案，不会在数据库中创建不必要的许多列