Python 合并CountVectorizer和SelectKBest会使标签消失_Python_Scikit Learn

Python 合并CountVectorizer和SelectKBest会使标签消失

python scikit-learn

Python 合并CountVectorizer和SelectKBest会使标签消失,python,scikit-learn,Python,Scikit Learn,我有一个类，它创建了一个特征提取管道，并适合逻辑回归模型。输入是DF结构中的一组字符串数据。ItemSelector类只返回包含原始数据帧中干净数据的列，然后将其传递给CountVectorizer和Kbest选择器。如果删除Kbest，此管道将工作： from sklearn.base import BaseEstimator, TransformerMixin class ItemSelector(BaseEstimator, TransformerMixin): # retur

我有一个类，它创建了一个特征提取管道，并适合逻辑回归模型。输入是DF结构中的一组字符串数据。ItemSelector类只返回包含原始数据帧中干净数据的列，然后将其传递给CountVectorizer和Kbest选择器。如果删除Kbest，此管道将工作：

from sklearn.base import BaseEstimator, TransformerMixin


class ItemSelector(BaseEstimator, TransformerMixin):
    # returns a single column from a DF
    def __init__(self, key):
        self.key = key

    def fit(self, x, y=None):
        return self

    def transform(self, data_dict):
        return data_dict[self.key]

class LogisticRegressionWithWordFeatures(object):

    def __init__(self):
        self.model = LogisticRegression()

    def fit(self, df, labels):
        self.pipeline = self.get_preprocessing_pipeline(df)
        fitted_df = self.pipeline.fit_transform(df)
        self.model.fit(fitted_df, labels)
        return self

    def predict(self, df):
        fitted_df = self.pipeline.transform(df)
        y = self.model.predict(fitted_df)
        return y


    def get_preprocessing_pipeline(self, data_frame):
        """
        Get data frame containing features and labels from raw feature input DF.
        :param input_file: input DF
        """

        process_and_join_features = Pipeline([
            ('features', FeatureUnion([
            ('count_lemma_features', Pipeline([
                ('selector', ItemSelector(key='clean_Invoice_Description')),
                ('counts', CountVectorizer(analyzer="word", stop_words='english'))]))])),
            ('reducer', SelectKBest(chi2, k=1000))
        ])
        return process_and_join_features

如果我尝试基于此管道进行拟合/变换，则会出现以下错误：

    model = LogisticRegressionWithWordFeatures()
    model.fit(train_data, train_labels)
    test_y = model.predict(test_data)

>>>

    TypeError                                 Traceback (most recent call last)
<ipython-input-183-536a1c9c0a09> in <module>
      1 b_logistic_regression_with_hypers_bow_clean = LogisticRegressionWithWordFeatures()
----> 2 b_logistic_regression_with_hypers_bow_clean = b_logistic_regression_with_hypers_bow_clean.fit(b_ebay_train_data, b_ebay_train_labels)
      3 b_ebay_y_with_hypers_bow_clean = b_logistic_regression_with_hypers_bow_clean.predict(b_ebay_test_data)
      4 b_gold_y_with_hypers_bow_clean = b_logistic_regression_with_hypers_bow_clean.predict(gold_df)

<ipython-input-181-6974b6ea2a5b> in fit(self, df, labels)
      6     def fit(self, df, labels):
      7         self.pipeline = self.get_preprocessing_pipeline(df)
----> 8         fitted_df = self.pipeline.fit_transform(df)
      9         self.model.fit(fitted_df, labels)
     10         return self

~/anaconda3/lib/python3.7/site-packages/sklearn/pipeline.py in fit_transform(self, X, y, **fit_params)
    391                 return Xt
    392             if hasattr(last_step, 'fit_transform'):
--> 393                 return last_step.fit_transform(Xt, y, **fit_params)
    394             else:
    395                 return last_step.fit(Xt, y, **fit_params).transform(Xt)

~/anaconda3/lib/python3.7/site-packages/sklearn/base.py in fit_transform(self, X, y, **fit_params)
    551         if y is None:
    552             # fit method of arity 1 (unsupervised transformation)
--> 553             return self.fit(X, **fit_params).transform(X)
    554         else:
    555             # fit method of arity 2 (supervised transformation)

TypeError: fit() missing 1 required positional argument: 'y'

但这会导致标签（费用类别）出现关键错误，即使该列在培训数据中

如果我一步一步地做，这是可行的：

item_selector = ItemSelector(key='clean_Invoice_Description').fit(train_data)
count_selector = CountVectorizer(analyzer="word", stop_words='english')
k_best = SelectKBest(chi2, k=1000)

invoice_desc = item_selector.transform(train_data)
invoice_desc = count_selector.fit_transform(invoice_desc)
reduced_desc = k_best.fit_transform(invoice_desc, train_labels)
print(reduced_desc.shape)
>>> (6130, 1000)

采用循序渐进的方法的问题在于，我希望在其他专栏中使用其他功能，而管道提供了一种很好的方法，无需手动组合它们。

解决了这个问题。主要问题是每个特征的嵌套。Pipelines（）需要一个元组列表，元组中的第一项是功能/管道名称，第二项是实际的类。当您添加更多特性时，很容易丢失嵌套的轨迹。以下是最终代码：

   def get_preprocessing_pipeline(self, data_frame):
        """
        Get data frame containing features and labels from raw feature input csv file"""

        process_and_join_features = Pipeline([
            ('features', 
             FeatureUnion([
                ('tokens',
                    Pipeline([
                        ('selector', ItemSelector(key='clean_Invoice_Description')),
                        ('vec', CountVectorizer(analyzer="word", stop_words='english')),
                        ('dim_red', SelectKBest(chi2, k=5000))
                    ])),
                ('hypernyms',
                    Pipeline([
                        ('selector', ItemSelector(key='hypernyms_combined')),
                        ('vec', TfidfVectorizer(analyzer="word")),
                        ('dim_red', SelectKBest(chi2, k=5000))
                ]))]))])
        return process_and_join_features

   def get_preprocessing_pipeline(self, data_frame):
        """
        Get data frame containing features and labels from raw feature input csv file"""

        process_and_join_features = Pipeline([
            ('features', 
             FeatureUnion([
                ('tokens',
                    Pipeline([
                        ('selector', ItemSelector(key='clean_Invoice_Description')),
                        ('vec', CountVectorizer(analyzer="word", stop_words='english')),
                        ('dim_red', SelectKBest(chi2, k=5000))
                    ])),
                ('hypernyms',
                    Pipeline([
                        ('selector', ItemSelector(key='hypernyms_combined')),
                        ('vec', TfidfVectorizer(analyzer="word")),
                        ('dim_red', SelectKBest(chi2, k=5000))
                ]))]))])
        return process_and_join_features