Python 文本特征提取为何不&x27；是否返回所有可能的功能名称？_Python_Scikit Learn_Nlp_Pytorch_Feature Extraction

Python 文本特征提取为何不&x27；是否返回所有可能的功能名称？

python scikit-learn nlp pytorch

Python 文本特征提取为何不&x27；是否返回所有可能的功能名称？,python,scikit-learn,nlp,pytorch,feature-extraction,Python,Scikit Learn,Nlp,Pytorch,Feature Extraction,以下是本书中的代码片段： vocab的值： vocab = ['an', 'arrow', 'banana', 'flies', 'fruit', 'like', 'time'] 为什么提取的要素名称中没有一个'a'？如果它被自动排除在太常见的单词之外，为什么“an”不会因为同样的原因被排除在外？如何使.get\u feature\u names（）同时过滤其他单词？非常好的问题！虽然这不是一个pytorch问题，而是一个sklearnone=）我鼓励大家先看一下，特别是“与sklearn的

以下是本书中的代码片段：

vocab

的值：

vocab = ['an', 'arrow', 'banana', 'flies', 'fruit', 'like', 'time']

为什么提取的要素名称中没有一个

'a'

？如果它被自动排除在太常见的单词之外，为什么“an”不会因为同样的原因被排除在外？如何使

.get\u feature\u names（）

同时过滤其他单词？

非常好的问题！虽然这不是一个

pytorch

问题，而是一个

sklearn

one=）

我鼓励大家先看一下，特别是“与sklearn的矢量化””部分

TL；博士如果我们使用

计数向量器

from io import StringIO
from sklearn.feature_extraction.text import CountVectorizer

sent1 = "The quick brown fox jumps over the lazy brown dog."
sent2 = "Mr brown jumps over the lazy fox."

with StringIO('\n'.join([sent1, sent2])) as fin:
    # Create the vectorizer
    count_vect = CountVectorizer()
    count_vect.fit_transform(fin)

# We can check the vocabulary in our vectorizer
# It's a dictionary where the words are the keys and 
# The values are the IDs given to each word. 
print(count_vect.vocabulary_)

[out]：

{'brown': 0,
 'dog': 1,
 'fox': 2,
 'jumps': 3,
 'lazy': 4,
 'mr': 5,
 'over': 6,
 'quick': 7,
 'the': 8}

我们没有告诉矢量器删除标点、标记和小写，他们是怎么做到的？

还有，这个词在词汇表中，它是一个停止词，我们希望它消失。。。而且跳跃没有被阻止或柠檬化

如果我们查看sklearn中CountVectorizer的文档，我们会看到：

CountVectorizer(
    input=’content’, encoding=’utf-8’, 
    decode_error=’strict’, strip_accents=None, 
    lowercase=True, preprocessor=None, 
    tokenizer=None, stop_words=None, 
    token_pattern=’(?u)\b\w\w+\b’, ngram_range=(1, 1), 
    analyzer=’word’, max_df=1.0, min_df=1, 
    max_features=None, vocabulary=None, 
    binary=False, dtype=<class ‘numpy.int64’>)

他们又都走了

现在如果我们深入研究文档

令牌\u模式：字符串表示什么构成“标记”的正则表达式，仅用于如果

分析器=='word'

。默认的regexp select标记为2 一个或多个字母数字字符（完全忽略标点符号并始终被视为令牌分隔符）

啊哈，这就是为什么所有的单字符标记都被删除了

countvectorier

的默认模式是

token\u pattern=r“（？u）\b\w\w+\b”

，要使其能够接受单个字符，您可以尝试：

>>> one_hot_vectorizer = CountVectorizer(token_pattern=r"(?u)\b\w+\b")           
>>> one_hot_vectorizer.fit(corpus)
CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w+\\b', tokenizer=None,
        vocabulary=None)
>>> one_hot_vectorizer.get_feature_names()
['1', '2', '3', 'a', 'an', 'arrow', 'banana', 'flies', 'fruit', 'like', 'time', 'x', 'y', 'z']

>>一个\u hot\u向量器=计数向量器（令牌\u模式=r“（？u）\b\w+\b”）
>>>一个热向量器。适合（语料库）
计数器矢量器（analyzer='word'，binary=False，decode_error='strict'，
数据类型=，编码='utf-8'，输入='content'，
小写=真，最大值为1.0，最大值为无，最小值为1，
ngram_范围=（1，1），预处理器=无，停止字=无，
strip\u accents=None，token\u pattern='（？u）\\b\\w+\\b'，tokenizer=None，
词汇（无）
>>>一个热向量器。获取功能名称（）
['1'，'2'，'3'，'a'，'an'，'arrow'，'bana'，'flies'，'fruit'，'like'，'time'，'x'，'y'，'z']

谢谢你，丽玲。漂亮的回答！

>>> from sklearn.feature_extraction.text import CountVectorizer
>>> one_hot_vectorizer = CountVectorizer(stop_words='english')

>>> one_hot_vectorizer.fit(corpus)
CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words='english',
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

>>> one_hot_vectorizer.get_feature_names()
['arrow', 'banana', 'flies', 'fruit', 'like', 'time']

>>> corpus = ['Time flies flies like an arrow 1 2 3.', 'Fruit flies like a banana x y z.']

>>> one_hot_vectorizer = CountVectorizer()

>>> one_hot_vectorizer.fit(corpus)
CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)
>>> one_hot_vectorizer.get_feature_names()                                         
['an', 'arrow', 'banana', 'flies', 'fruit', 'like', 'time']

>>> one_hot_vectorizer = CountVectorizer(token_pattern=r"(?u)\b\w+\b")           
>>> one_hot_vectorizer.fit(corpus)
CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w+\\b', tokenizer=None,
        vocabulary=None)
>>> one_hot_vectorizer.get_feature_names()
['1', '2', '3', 'a', 'an', 'arrow', 'banana', 'flies', 'fruit', 'like', 'time', 'x', 'y', 'z']