Python 使用管道和网格搜索执行要素选择_Python_Scikit Learn_Pipeline_Feature Selection_Grid Search

Python 使用管道和网格搜索执行要素选择

python scikit-learn

Python 使用管道和网格搜索执行要素选择,python,scikit-learn,pipeline,feature-selection,grid-search,Python,Scikit Learn,Pipeline,Feature Selection,Grid Search,作为研究项目的一部分，我想选择预处理技术和文本特征的最佳组合，以优化文本分类任务的结果。为此，我使用的是Python 3.6 有很多方法可以组合功能和算法，但我想充分利用sklearn的管道，并使用网格搜索来测试所有不同（有效）的可能性，以获得最终的功能组合我的第一步是构建一个如下所示的管道： # Run a vectorizer with a predefined tweet tokenizer and a Naive Bayes pipeline = Pipeline([ ('v

作为研究项目的一部分，我想选择预处理技术和文本特征的最佳组合，以优化文本分类任务的结果。为此，我使用的是Python 3.6

有很多方法可以组合功能和算法，但我想充分利用sklearn的管道，并使用网格搜索来测试所有不同（有效）的可能性，以获得最终的功能组合

我的第一步是构建一个如下所示的管道：

# Run a vectorizer with a predefined tweet tokenizer and a Naive Bayes

pipeline = Pipeline([
    ('vectorizer', CountVectorizer(tokenizer = tweet_tokenizer)),
    ('nb', MultinomialNB())
])

parameters = {
'vectorizer__preprocessor': (None, preprocessor)
}

gs =  GridSearchCV(pipeline, parameters, cv=5, n_jobs=-1, verbose=1)

在这个简单的示例中，矢量器使用tweet_标记器标记数据，然后测试哪个预处理选项（无或预定义函数）效果更好

这似乎是一个不错的开始，但我现在正在努力寻找一种方法来测试预处理器函数中的所有不同可能性，定义如下：

def preprocessor(tweet):
    # Data cleaning
    tweet = URL_remover(tweet) # Removing URLs
    tweet = mentions_remover(tweet) # Removing mentions
    tweet = email_remover(tweet) # Removing emails
    tweet = irrelev_chars_remover(tweet) # Removing invalid chars
    tweet = emojies_converter(tweet) # Translating emojies
    tweet = to_lowercase(tweet) # Converting words to lowercase
    # Others
    tweet = hashtag_decomposer(tweet) # Hashtag decomposition
    # Punctuation may only be removed after hashtag decomposition  
    # because it considers "#" as punctuation
    tweet = punct_remover(tweet) # Punctuation 
    return tweet

组合所有不同处理技术的“简单”解决方案是为每种可能性（例如funcA:proc1、funcB:proc1+proc2、funcC:proc1+proc3等）创建不同的函数，并按如下方式设置网格参数：

parameters = {
   'vectorizer__preprocessor': (None, funcA, funcB, funcC, ...)
}

尽管这很有可能奏效，但对于这项任务来说，这并不是一个可行或合理的解决方案，尤其是因为存在

2^n_功能

不同的组合，以及相应的功能

最终目标是在管道中结合预处理技术和特征，以便使用gridsearch优化分类结果：

pipeline = Pipeline([
    ('vectorizer', CountVectorizer(tokenizer = tweet_tokenizer)),
    ('feat_extractor' , feat_extractor)
    ('nb', MultinomialNB())
])

 parameters = {
   'vectorizer__preprocessor': (None, funcA, funcB, funcC, ...)
   'feat_extractor': (None, func_A, func_B, func_C, ...)
 }

有没有更简单的方法来获取此信息？

根据您的描述，此解决方案非常粗略，并根据所使用的数据类型针对具体的答案。在创建管道之前，让我们先了解

countvectorier

如何处理传入的

raw_文档。本质上，它将字符串文档处理为标记
return lambda doc: self._word_ngrams(tokenize(preprocess(self.decode(doc))), stop_words)

然后将其计数并转换为计数矩阵
所以这里发生的是：
解码
：只需决定如何从文件中读取数据（如果指定）。对于我们来说没有用，因为我们已经将数据放入了列表中
预处理
：如果CountVectorizer
中的'strip\u accents'
和'lowercase'
为True
，则会执行以下操作。没有别的了
strip_accents(x.lower())

同样，没有用，因为我们正在将小写功能移动到我们自己的预处理器，并且不需要去除重音符号，因为我们已经在字符串列表中有了数据
tokenize
：将删除所有标点符号，只保留长度为2或2以上的字母数字单词，并返回单个文档的标记列表（列表元素）
应该记住这一点。如果您想自己处理标点符号和其他符号（决定保留一些并删除其他符号），那么最好也更改CountVectorizer
的默认标记\u模式='（？u）\b\w\w+\b'

\u word\u ngrams
：此方法将首先从上一步的令牌列表中删除停止字（作为上述参数提供），然后根据CountVectorizer
中的ngram\u range
参数计算n g。如果您想按自己的方式处理“n_grams”
，也应该记住这一点

注意：如果分析器被设置为'char'
，则将不执行标记器
步骤，并且将从字符中生成n_g
现在来看看我们的管道。这就是我认为可以在这里工作的结构：
X --> combined_pipeline, Pipeline
            |
            |  Raw data is passed to Preprocessor
            |
            \/
         Preprocessor 
                 |
                 |  Cleaned data (still raw texts) is passed to FeatureUnion
                 |
                 \/
              FeatureUnion
                      |
                      |  Data is duplicated and passed to both parts
       _______________|__________________
      |                                  |
      |                                  |                         
      \/                                \/
   CountVectorizer                  FeatureExtractor
           |                                  |   
           |   Converts raw to                |   Extracts numerical features
           |   count-matrix                   |   from raw data
           \/________________________________\/
                             |
                             | FeatureUnion combines both the matrices
                             |
                             \/
                          Classifier

现在是代码。这是管道的外观：
# Imports
from sklearn.svm import SVC
from sklearn.pipeline import FeatureUnion, Pipeline

# Pipeline
pipe = Pipeline([('preprocessor', CustomPreprocessor()), 
                 ('features', FeatureUnion([("vectorizer", CountVectorizer()),
                                            ("extractor", CustomFeatureExtractor())
                                            ]))
                 ('classifier', SVC())
                ])

其中CustomPreprocessor
和CustomFeatureExtractor
定义为：
from sklearn.base import TransformerMixin, BaseEstimator

class CustomPreprocessor(BaseEstimator, TransformerMixin):
    def __init__(self, remove_urls=True, remove_mentions=True, 
                 remove_emails=True, remove_invalid_chars=True, 
                 convert_emojis=True, lowercase=True, 
                 decompose_hashtags=True, remove_punctuations=True):
        self.remove_urls=remove_urls
        self.remove_mentions=remove_mentions
        self.remove_emails=remove_emails
        self.remove_invalid_chars=remove_invalid_chars
        self.convert_emojis=convert_emojis
        self.lowercase=lowercase
        self.decompose_hashtags=decompose_hashtags
        self.remove_punctuations=remove_punctuations

    # You Need to have all the functions ready
    # This method works on single tweets
    def preprocessor(self, tweet):
        # Data cleaning
        if self.remove_urls:
            tweet = URL_remover(tweet) # Removing URLs

        if self.remove_mentions:
            tweet = mentions_remover(tweet) # Removing mentions

        if self.remove_emails:
            tweet = email_remover(tweet) # Removing emails

        if self.remove_invalid_chars:
            tweet = irrelev_chars_remover(tweet) # Removing invalid chars

        if self.convert_emojis:
            tweet = emojies_converter(tweet) # Translating emojies

        if self.lowercase:
            tweet = to_lowercase(tweet) # Converting words to lowercase

        if self.decompose_hashtags:
            # Others
            tweet = hashtag_decomposer(tweet) # Hashtag decomposition

        # Punctuation may only be removed after hashtag decomposition  
        # because it considers "#" as punctuation
        if self.remove_punctuations:
            tweet = punct_remover(tweet) # Punctuation 

        return tweet

    def fit(self, raw_docs, y=None):
        # Noop - We dont learn anything about the data
        return self

    def transform(self, raw_docs):
        return [self.preprocessor(tweet) for tweet in raw_docs]

from textblob import TextBlob
import numpy as np
# Same thing for feature extraction
class CustomFeatureExtractor(BaseEstimator, TransformerMixin):
    def __init__(self, sentiment_analysis=True, tweet_length=True):
        self.sentiment_analysis=sentiment_analysis
        self.tweet_length=tweet_length

    # This method works on single tweets
    def extractor(self, tweet):
        features = []

        if self.sentiment_analysis:
            blob = TextBlob(tweet)
            features.append(blob.sentiment.polarity)

        if self.tweet_length:
            features.append(len(tweet))

        # Do for other features you want.

        return np.array(features)

    def fit(self, raw_docs, y):
        # Noop - Again I am assuming that We dont learn anything about the data
        # Definitely not for tweet length, and also not for sentiment analysis
        # Or any other thing you might have here.
        return self

    def transform(self, raw_docs):
        # I am returning a numpy array so that the FeatureUnion can handle that correctly
        return np.vstack(tuple([self.extractor(tweet) for tweet in raw_docs]))

最后，现在可以轻松完成参数网格，如：
param_grid = ['preprocessor__remove_urls':[True, False],
              'preprocessor__remove_mentions':[True, False],
              ...
              ...
              # No need to search for lowercase or preprocessor in CountVectorizer 
              'features__vectorizer__max_df':[0.1, 0.2, 0.3],
              ...
              ...
              'features__extractor__sentiment_analysis':[True, False],
              'features__extractor__tweet_length':[True, False],
              ...
              ...
              'classifier__C':[0.01, 0.1, 1.0]
            ]

上述代码是为了避免“为每种可能性（例如funcA:proc1、funcB:proc1+proc2、funcC:proc1+proc3等）创建不同的函数”
”。只要做正确的，错误的，GridSearchCV就会处理这个问题
更新：
如果您不想使用计数矢量器
，则可以从管道和参数网格中删除它，新管道将为：
pipe = Pipeline([('preprocessor', CustomPreprocessor()), 
                 ("extractor", CustomFeatureExtractor()),
                 ('classifier', SVC())
                ])

然后确保在CustomFeatureExtractor
中实现您想要的所有功能。如果这变得太复杂，那么您可以始终制作更简单的提取器，并在FeatureUnion中将它们组合在一起，以代替CountVectorizer
在CountVectorizer
之后的（'feat\u提取器'，feat\u提取器）
应该做什么？管道将通过CountVectorizer
传递数据，然后将新数据（计数矩阵，而不是单词）传递到feat\u提取器
。这是你想要的吗？或者您希望将feat_提取器
包含在预处理器
中，如您所述？@VivekKumar feat_提取器应仅对原始文本起作用。我知道这需要CountVectorizer的输出，但这只是一个拙劣的尝试，显示我正在尝试做什么。CountVectorizer只允许一组特定的特征（例如n-grams），我想执行更多的特征提取（例如情绪分析），如果您想在相同的数据上同时使用这两种特征（vectorizer
和feat\u extractor
），那么FeatureUnion
可以提供帮助。那么现在关于func\u A
，func\u B
等：它们对于矢量器预处理器和feat\u提取器都是相同的吗？不，它们应该是不同的函数。对于矢量器预处理器，funcA将结合预处理1和2（例如小写+表情转换器），而feat_提取器的func_A将结合功能1和2或任何其他可能的组合（例如n-grams+情绪分析+推特长度）。预处理功能将仅包含数据清理功能的组合，而特征提取功能将启用
pipe = Pipeline([('preprocessor', CustomPreprocessor()), 
                 ("extractor", CustomFeatureExtractor()),
                 ('classifier', SVC())
                ])