Pandas 如何使用示例所属的类别列表作为特征解决分类问题?
其中一个功能如下所示:Pandas 如何使用示例所属的类别列表作为特征解决分类问题?,pandas,machine-learning,scikit-learn,Pandas,Machine Learning,Scikit Learn,其中一个功能如下所示: 1 170,169,205,174,173,246,247,249,380,377,383,38... 2 448,104,239,277,276,99,154,155,76,412,139,333,... 3 268,422,419,124,1,17,431,343,341,435,130,331,5... 4 50,53,449,106,279,420,161,74,123,364,231,18,23... 5
1 170,169,205,174,173,246,247,249,380,377,383,38...
2 448,104,239,277,276,99,154,155,76,412,139,333,...
3 268,422,419,124,1,17,431,343,341,435,130,331,5...
4 50,53,449,106,279,420,161,74,123,364,231,18,23...
5 170,169,205,174,173,246,247,249,380,377,383,38...
它告诉我们该示例属于哪些类别。
在解决分类问题时,我应该如何使用它
我尝试过使用虚拟变量
df=df.join(features['cat'].str.get_dummies(',').add_prefix('contains_'))
但是我们不知道在训练集中没有提到的其他类别在哪里,因此,我不知道如何预处理所有对象。这很有趣。我不知道怎么做,但是我可以帮你做剩下的事 你基本上有两个问题:
# create a set of all categories, you want to allow
# either definie it as a fixed set, or extract it from your
# column like this (the output of the map is actually irrelevant)
# the result will be in valid_categories
valid_categories= set()
df['categories'].str.split(',').map(valid_categories.update)
# now if you want to normalize your data before you do the
# dummy encoding, you can cleanse the data by
# splitting it, creating an intersection and then joining
# it back again to get a string on which you can work with
# str.get_dummies
df['categories'].str.split(',').map(lambda l: valid_categories.intersection(l)).str.join(',')
问题2:为所有已知类别生成假人
第二个问题可以通过添加一个虚拟行来解决,即
包含所有类别,例如带有df。在您之前添加
调用get_dummies
并在get_dummies
之后立即将其删除
# e.g. you can do it like this
# get a new index value to
# be able to remove the row later
# (this only works if you have
# a numeric index)
dummy_index= df.index.max()+1
# assign the categories
#
df.loc[dummy_index]= {'id':999, 'categories': ','.join(valid_categories)}
# now do the processing steps
# mentioned in the section above
# then create the dummies
# after that remove the dummy line
# again
df.drop(labels=[dummy_index], inplace=True)
例如:
import io
raw= """id categories
1 170,169,205,174,173,246,247
2 448,104,239,277,276,99,154
3 268,422,419,124,1,17,431,343
4 50,53,449,106,279,420,161,74
5 170,169,205,174,173,246,247"""
df= pd.read_fwf(io.StringIO(raw))
valid_categories= set()
df['categories'].str.split(',').map(valid_categories.update)
# remove 154 and 170 for demonstration purposes
valid_categories.remove('170')
valid_categories.remove('154')
df['categories'].str.split(',').map(lambda l: valid_categories.intersection(l)).str.join(',').str.get_dummies(',')
Out[622]:
1 104 106 124 161 169 17 173 174 205 239 246 247 268 276 277 279 343 419 420 422 431 448 449 50 53 74 99
0 0 0 0 0 0 1 0 1 1 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 0 1 0 0 0 0 0 0 0 0 1 0 0 0 1 1 0 0 0 0 0 0 1 0 0 0 0 1
2 1 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0 0 1 1 0 1 1 0 0 0 0 0 0
3 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 1 1 1 1 0
4 0 0 0 0 0 1 0 1 1 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
您可以看到,154和170没有列。您想告诉我们什么?@PV8:这是ML和虚拟列的老问题。如果在特定的柱布局上训练模型,则需要在以后的生产阶段应用相同的柱布局。您的模型不希望看到在培训期间不存在的列,同样,如果列消失,它也不希望看到,因为导致生成列的数据不再存在。