Python 在scikit learn中结合异构功能_Python_Scikit Learn_Feature Extraction

Python 在scikit learn中结合异构功能

python scikit-learn

Python 在scikit learn中结合异构功能,python,scikit-learn,feature-extraction,Python,Scikit Learn,Feature Extraction,我正在对一些文档进行二元分类，这些文档的特征已经被提取并在文本文件中给出。我的问题是有文本特征和数字特征，比如年份和其他一些。以下格式给出了一个示例： label |title text |otherText text |numFeature1 number |numFeature2 number 我正在关注关于的文档，但是它们的用例有点不同。我不会从另一个特征中提取特征，因为这些数字特征已经给出目前，我以以下方式使用设置： pipeline = Pipeline([ ('features

我正在对一些文档进行二元分类，这些文档的特征已经被提取并在文本文件中给出。我的问题是有文本特征和数字特征，比如年份和其他一些。以下格式给出了一个示例：

label |title text |otherText text |numFeature1 number |numFeature2 number

我正在关注关于的文档，但是它们的用例有点不同。我不会从另一个特征中提取特征，因为这些数字特征已经给出

目前，我以以下方式使用设置：

pipeline = Pipeline([
('features', Features()),

('union', FeatureUnion(
    transformer_list=[
        ('title', Pipeline([
            ('selector', ItemSelector(key='title')),
            ('tfidf', TfidfVectorizer()),
        ])),
        ('otherText', Pipeline([
            ('selector', ItemSelector(key='otherText')),
            ('tfidf', TfidfVectorizer()),
        ])),
        ('numFeature1', Pipeline([
            ('selector', ItemSelector(key='numFeature1')),
        ])),
        ('numFeature2', Pipeline([
            ('selector', ItemSelector(key='numFeature2')),
        ])),
    ],
)),
('classifier', MultinomialNB()),
])

还采用了文档中的要素类：

class Features(BaseEstimator, TransformerMixin):
  def fit(self, x, y=None):
    return self

  def transform(self, posts):
    features = np.recarray(shape=(len(posts),),
                           dtype=[('title', object),('otherText', object),
                                  ('numFeature1', object),('numFeature2', object)])

    for i, text in enumerate(posts):
        l = re.split("\|\w+", text)
        features['title'][i] = l[1]
        features['otherText'][i] = l[2]
        features['numFeature1'][i] = l[3]
        features['numFeature2'][i] = l[4]

    return features

我现在的问题是：如何将数字特征添加到FeatureUnion中？当使用CountVectorizer时，我会得到“ValueError:空词汇表；可能文档只包含停止词”，而使用只有一个条目的DictVectorizer对我来说并不是一个好办法。

TfidVectorizer（）对象还没有安装数据

在建造管道之前，请执行以下操作：

vec = TfidfVectorizer()
vec.fit(data['free text column'])
pipeline = Pipeline([
('features', Features()),

('union', FeatureUnion(
    transformer_list=[
        ('title', Pipeline([
            ('selector', ItemSelector(key='title')),
            ('tfidf', vec),
        ])),

        ... other features

这有助于您重新调整数据以用于测试目的。。。因为对于测试数据，管道将自动使用transform（）函数作为

TfidVectorizer

而不是fit（）函数，而fit（）函数在构建管道之前必须显式执行根据构造函数中提供的

键

，返回一维

[n，]

数组

FeatureUnion

未正确处理此类型的

[n，]

数组

FeatureUnion

要求每个内部

变压器的2维阵列，其中第1维（样本数量）应一致，最终可水平堆叠以组合特征
前两个转换器中的第二个操作（TfidVectorizer（）
）从ItemSelector获取此[n，]
数组，并输出有效的[n，m]
数组类型，其中m=从原始文本提取的特征数
但是您的第三个和第四个转换器只包含项选择器（）
，因此它输出[n，]数组。这就是错误的原因
要更正此问题，应将ItemSelector
的输出重塑为[n，1]。
在ItemSelector.transform（）
中更改以下代码（我假设您使用的是指定链接中的ItemSelector代码）：
原创
data_dict[self.key]

新的
data_dict[self.key].reshape((-1,1))

reformate（）
将把您的[n，]
格式化为[n，1]
，然后FeatureUnion可以使用它正确地附加数据。
我使用pipeline.fit（train.data，train.target）
在构建管道后调整数据，就像文档所做的那样。我的问题是需要采取什么步骤将剩余的非文本功能集成到管道中。只需使用带有key='numFeature1'和'numFeature2'的ItemSelector（）类即可返回ValueError:blocks[0，：]具有不兼容的行维度
显示整个管道的代码。我已据此编辑了我的问题。请查看我的相关问题。