Pandas 如何使用示例所属的类别列表作为特征解决分类问题?

Pandas 如何使用示例所属的类别列表作为特征解决分类问题?,pandas,machine-learning,scikit-learn,Pandas,Machine Learning,Scikit Learn,其中一个功能如下所示: 1 170,169,205,174,173,246,247,249,380,377,383,38... 2 448,104,239,277,276,99,154,155,76,412,139,333,... 3 268,422,419,124,1,17,431,343,341,435,130,331,5... 4 50,53,449,106,279,420,161,74,123,364,231,18,23... 5

其中一个功能如下所示:

1       170,169,205,174,173,246,247,249,380,377,383,38...
2       448,104,239,277,276,99,154,155,76,412,139,333,...
3       268,422,419,124,1,17,431,343,341,435,130,331,5...
4       50,53,449,106,279,420,161,74,123,364,231,18,23...
5       170,169,205,174,173,246,247,249,380,377,383,38...
它告诉我们该示例属于哪些类别。 在解决分类问题时,我应该如何使用它

我尝试过使用虚拟变量

df=df.join(features['cat'].str.get_dummies(',').add_prefix('contains_'))

但是我们不知道在训练集中没有提到的其他类别在哪里,因此,我不知道如何预处理所有对象。

这很有趣。我不知道怎么做,但是我可以帮你做剩下的事

你基本上有两个问题:

  • 稍后获得的类别集包含训练模型时未知的类别。你以后必须把这些东西处理掉

  • 稍后获得的类别集不包含所有类别。你必须确保,你也为他们生成了假人

  • 问题1:筛选出未知/不需要的类别 第一个问题很容易解决:

    # create a set of all categories, you want to allow
    # either definie it as a fixed set, or extract it from your
    # column like this (the output of the map is actually irrelevant)
    # the result will be in valid_categories
    valid_categories= set()
    df['categories'].str.split(',').map(valid_categories.update)
    
    # now if you want to normalize your data before you do the
    # dummy encoding, you can cleanse the data by
    # splitting it, creating an intersection and then joining
    # it back again to get a string on which you can work with
    # str.get_dummies
    df['categories'].str.split(',').map(lambda l: valid_categories.intersection(l)).str.join(',')
    
    问题2:为所有已知类别生成假人 第二个问题可以通过添加一个虚拟行来解决,即 包含所有类别,例如带有
    df。在您之前添加
    
    调用
    get_dummies
    并在
    get_dummies
    之后立即将其删除

    # e.g. you can do it like this
    # get a new index value to
    # be able to remove the row later
    # (this only works if you have
    # a numeric index)
    dummy_index= df.index.max()+1
    
    # assign the categories
    # 
    df.loc[dummy_index]= {'id':999, 'categories': ','.join(valid_categories)}
    # now do the processing steps 
    # mentioned in the section above
    # then create the dummies
    # after that remove the dummy line
    # again
    df.drop(labels=[dummy_index], inplace=True)
    
    例如:

    import io
    
    raw= """id      categories
    1       170,169,205,174,173,246,247
    2       448,104,239,277,276,99,154
    3       268,422,419,124,1,17,431,343
    4       50,53,449,106,279,420,161,74
    5       170,169,205,174,173,246,247"""
    df= pd.read_fwf(io.StringIO(raw))
    
    valid_categories= set()
    df['categories'].str.split(',').map(valid_categories.update)
    # remove 154 and 170 for demonstration purposes
    valid_categories.remove('170')
    valid_categories.remove('154')
    
    df['categories'].str.split(',').map(lambda l: valid_categories.intersection(l)).str.join(',').str.get_dummies(',')
    Out[622]: 
       1  104  106  124  161  169  17  173  174  205  239  246  247  268  276  277  279  343  419  420  422  431  448  449  50  53  74  99
    0  0    0    0    0    0    1   0    1    1    1    0    1    1    0    0    0    0    0    0    0    0    0    0    0   0   0   0   0
    1  0    1    0    0    0    0   0    0    0    0    1    0    0    0    1    1    0    0    0    0    0    0    1    0   0   0   0   1
    2  1    0    0    1    0    0   1    0    0    0    0    0    0    1    0    0    0    1    1    0    1    1    0    0   0   0   0   0
    3  0    0    1    0    1    0   0    0    0    0    0    0    0    0    0    0    1    0    0    1    0    0    0    1   1   1   1   0
    4  0    0    0    0    0    1   0    1    1    1    0    1    1    0    0    0    0    0    0    0    0    0    0    0   0   0   0   0
    

    您可以看到,154和170没有列。

    您想告诉我们什么?@PV8:这是ML和虚拟列的老问题。如果在特定的柱布局上训练模型,则需要在以后的生产阶段应用相同的柱布局。您的模型不希望看到在培训期间不存在的列,同样,如果列消失,它也不希望看到,因为导致生成列的数据不再存在。