Python 向scikit学习添加数字以停止单词'；s计数矢量器_Python_Scikit Learn_Countvectorizer

Python 向scikit学习添加数字以停止单词'；s计数矢量器

python scikit-learn

Python 向scikit学习添加数字以停止单词'；s计数矢量器,python,scikit-learn,countvectorizer,Python,Scikit Learn,Countvectorizer,说明如何将您自己的单词添加到CountVectorizer的内置英文停止词中。我感兴趣的是，消除任何数字作为标记对分类器的影响 ENGLISH\u STOP\u WORDS存储为冻结集，因此我想我的问题可以归结为（除非有一种方法我不知道）是否可以将任意数字表示添加到冻结列表中我对这个问题的感觉是这是不可能的，因为你必须通过的名单的有限性排除了这一点我想完成同样的事情的一种方法是循环测试语料库和pop单词，其中word.isdigit（）适用于一个集合/列表，然后我可以与ENGLISH\u S

说明如何将您自己的单词添加到

CountVectorizer

的内置英文停止词中。我感兴趣的是，消除任何数字作为标记对分类器的影响

ENGLISH\u STOP\u WORDS

存储为冻结集，因此我想我的问题可以归结为（除非有一种方法我不知道）是否可以将任意数字表示添加到冻结列表中

我对这个问题的感觉是这是不可能的，因为你必须通过的名单的有限性排除了这一点

我想完成同样的事情的一种方法是循环测试语料库和pop单词，其中

word.isdigit（）

适用于一个集合/列表，然后我可以与

ENGLISH\u STOP\u words

（）结合，但我宁愿懒惰，把更简单的东西传递给

STOP\u words

参数

您可以将其作为

计数器向量器的自定义预处理器来实现，而不是扩展停止字列表。下面是bpython
中显示的简单版本
>>> import re
>>> cv = CountVectorizer(preprocessor=lambda x: re.sub(r'(\d[\d\.])+', 'NUM', x.lower()))
>>> cv.fit(['This is sentence.', 'This is a second sentence.', '12 dogs eat candy', '1 2 3 45'])
CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1),
        preprocessor=<function <lambda> at 0x109bbcb18>, stop_words=None,
        strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)
>>> cv.vocabulary_
{u'sentence': 6, u'this': 7, u'is': 4, u'candy': 1, u'dogs': 2, u'second': 5, u'NUM': 0, u'eat': 3}

>>重新导入
>>>cv=CountVectorizer（预处理器=lambda x:re.sub（r'（\d[\d\.]）+'，'NUM'，x.lower（））
>>>cv.fit（[“这是一句话”，“这是第二句话”，“12条狗吃糖果”，“12345]”）
计数器矢量器（分析器=u'word'，二进制=False，解码错误=u'strict'，
数据类型=，编码=u'utf-8'，输入=u'content'，
小写=真，最大值为1.0，最大值为无，最小值为1，
ngram_范围=（1,1），
预处理器=，停止字=无，
strip\u accents=None，token\u pattern=u'（？u）\\b\\w\\w+\\b'，
标记器=无，词汇表=无）
>>>简历词汇_
{u'sence'：6，u'this'：7，u'is'：4，u'candy'：1，u'dogs'：2，u'second'：5，u'NUM'：0，u'eat'：3}

对regexp进行预编译可能会在大量样本上提供一些加速
import re
from sklearn.feature_extraction.text import CountVectorizer

list_of_texts = ['This is sentence.', 'This is a second sentence.', '12 dogs eat candy', '1 2 3 45']

def no_number_preprocessor(tokens):
    r = re.sub('(\d)+', 'NUM', tokens.lower())
    # This alternative just removes numbers:
    # r = re.sub('(\d)+', '', tokens.lower())
    return r

for t in list_of_texts:
    no_num_t = no_number_preprocessor(t)
    print(no_num_t)

cv = CountVectorizer(input='content', preprocessor=no_number_preprocessor)
dtm = cv.fit_transform(list_of_texts)
cv_vocab = cv.get_feature_names()

print(cv_vocab)

出局
this is sentence.

this is a second sentence.

NUM dogs eat candy

NUM NUM NUM NUM

['NUM', 'candy', 'dogs', 'eat', 'is', 'second', 'sentence', 'this']