Python 在scikit learn中将单词列表转换为整数列表_Python_Nlp_Scikit Learn

Python 在scikit learn中将单词列表转换为整数列表

python nlp scikit-learn

Python 在scikit learn中将单词列表转换为整数列表,python,nlp,scikit-learn,Python,Nlp,Scikit Learn,我想在scikit learn中将单词列表转换为整数列表，并对包含单词列表的语料库进行转换。语料库可以是一组句子我可以使用以下方法，但有没有更简单的方法？我怀疑我可能缺少了一些CountVectorizer功能，因为这是自然语言处理中常见的预处理步骤。在这段代码中，我首先安装CountVectorizer，然后我必须迭代每个单词列表中的每个单词以生成整数列表 import sklearn import sklearn.feature_extraction import numpy as np

我想在scikit learn中将单词列表转换为整数列表，并对包含单词列表的语料库进行转换。语料库可以是一组句子

我可以使用以下方法，但有没有更简单的方法？我怀疑我可能缺少了一些CountVectorizer功能，因为这是自然语言处理中常见的预处理步骤。在这段代码中，我首先安装CountVectorizer，然后我必须迭代每个单词列表中的每个单词以生成整数列表

import sklearn
import sklearn.feature_extraction
import numpy as np

def reverse_dictionary(dict):
    '''
    http://stackoverflow.com/questions/483666/python-reverse-inverse-a-mapping
    '''
    return {v: k for k, v in dict.items()}

vectorizer = sklearn.feature_extraction.text.CountVectorizer(min_df=1)

corpus = ['This is the first document.',
        'This is the second second document.',
        'And the third one.',
        'Is this the first document? This is right.',]

X = vectorizer.fit_transform(corpus).toarray()

tokenizer = vectorizer.build_tokenizer()
output_corpus = []
for line in corpus: 
    line = tokenizer(line.lower())
    output_line = np.empty_like(line, dtype=np.int)
    for token_number, token in np.ndenumerate(line):
        output_line[token_number] = vectorizer.vocabulary_.get(token) 
    output_corpus.append(output_line)
print('output_corpus: {0}'.format(output_corpus))

word2idx = vectorizer.vocabulary_
print('word2idx: {0}'.format(word2idx))

idx2word = reverse_dictionary(word2idx)
print('idx2word: {0}'.format(idx2word))

产出：

output_corpus: [array([9, 3, 7, 2, 1]), # 'This is the first document.'
                array([9, 3, 7, 6, 6, 1]), # 'This is the second second document.'
                array([0, 7, 8, 4]), # 'And the third one.'
                array([3, 9, 7, 2, 1, 9, 3, 5])] # 'Is this the first document? This is right.'
word2idx: {u'and': 0, u'right': 5, u'third': 8, u'this': 9, u'is': 3, u'one': 4,
           u'second': 6, u'the': 7, u'document': 1, u'first': 2}
idx2word: {0: u'and', 1: u'document', 2: u'first', 3: u'is', 4: u'one', 5: u'right', 
           6: u'second', 7: u'the', 8: u'third', 9: u'this'}

我不知道是否有更直接的方法，但是您可以通过使用

map

而不是for循环来迭代每个单词来简化语法

您可以使用

build\u analyzer（）

，它同时处理预处理和标记化，这样就不需要显式调用

lower（）

analyzer = vectorizer.build_analyzer()
output_corpus = [map(lambda x: vectorizer.vocabulary_.get(x), analyzer(line)) for line in corpus]
# For Python 3.x it should be
# [list(map(lambda x: vectorizer.vocabulary_.get(x), analyzer(line))) for line in corpus]

输出语料库：

[[9, 3, 7, 2, 1], [9, 3, 7, 6, 6, 1], [0, 7, 8, 4], [3, 9, 7, 2, 1, 9, 3, 5]]

编辑

多亏了@user3914041，在这种情况下，仅仅使用列表理解可能更可取。它避免了

lambda

，因此可以比

map

稍快一些。（根据和我的简单测试。）

我不知道是否有更直接的方法，但是您可以通过使用

map

而不是for循环来迭代每个单词来简化语法

您可以使用

build\u analyzer（）

，它同时处理预处理和标记化，这样就不需要显式调用

lower（）

analyzer = vectorizer.build_analyzer()
output_corpus = [map(lambda x: vectorizer.vocabulary_.get(x), analyzer(line)) for line in corpus]
# For Python 3.x it should be
# [list(map(lambda x: vectorizer.vocabulary_.get(x), analyzer(line))) for line in corpus]

输出语料库：

[[9, 3, 7, 2, 1], [9, 3, 7, 6, 6, 1], [0, 7, 8, 4], [3, 9, 7, 2, 1, 9, 3, 5]]

编辑

多亏了@user3914041，在这种情况下，仅仅使用列表理解可能更可取。它避免了

lambda

，因此可以比

map

稍快一些。（根据和我的简单测试。）

我经常在python中使用计数器来解决这个问题，例如

from collections import Counter

corpus = ['This is the first document.',
        'This is the second second document.',
        'And the third one.',
        'Is this the first document? This is right.',]

#convert to str from list and split
as_one = ''
for sentence in corpus:
    as_one = as_one + ' ' + sentence

words = as_one.split()

from collections import Counter
counts = Counter(words)
vocab = sorted(counts, key=counts.get, reverse=True)
vocab_to_int = {word: ii for ii, word in enumerate(vocab, 1)}

print(vocab_to_int)

输出：

{'the'：1，'This'：2，'is'：3，'first'：4，'document'：5，'second'： 6“And”：7，“third”：8，“one.”：9“Is”：10“this”：11“document？”： 12，‘对’：13}

我经常在python中使用计数器来解决这个问题，例如

from collections import Counter

corpus = ['This is the first document.',
        'This is the second second document.',
        'And the third one.',
        'Is this the first document? This is right.',]

#convert to str from list and split
as_one = ''
for sentence in corpus:
    as_one = as_one + ' ' + sentence

words = as_one.split()

from collections import Counter
counts = Counter(words)
vocab = sorted(counts, key=counts.get, reverse=True)
vocab_to_int = {word: ii for ii, word in enumerate(vocab, 1)}

print(vocab_to_int)

输出：

{'the'：1，'This'：2，'is'：3，'first'：4，'document'：5，'second'： 6“And”：7，“third”：8，“one.”：9“Is”：10“this”：11“document？”： 12，‘对’：13}

对于给定的文本，

countvectorier

用于返回一个向量，该向量是每个单词的计数

例如，对于语料库：

corpus=[“猫”、“狗”]

，矢量器将找到3个不同的单词，因此它将输出维度3的向量，其中“the”对应第一个维度，“cat”对应第二个维度，“dog”对应第三个维度。例如，“猫”将被转换为[1,1,0]，“狗”将被转换为[1,0,1]，具有重复单词的句子将具有更大的值（例如，“猫”→ [1,2,0]）

无论你想做什么，你都会在包裹上玩得很开心。您只需执行以下操作（在终端中运行pip install zeugma后）：

来自zeugma导入文本序列的

>>
>>>sequencer=TextsToSequences（）
>>>sequencer.fit_transform（[“这是一个句子。”，“和另一个。”]））
数组（[[1,2,3,4]，[5,6,7]]，dtype=object）

您可以随时访问“单词映射索引：使用

>>> sequencer.index_word
{1: 'this', 2: 'is', 3: 'a', 4: 'sentence', 5: 'and', 6: 'another', 7: 'one'}

在此基础上，您可以使用以下映射转换任何新句子：

>>> sequencer.transform(["a sentence"])
array([[3, 4]])

我希望它有帮助！

对于给定的文本，

countvectorier

将返回一个向量，即每个单词的计数

例如，对于语料库：

corpus=[“猫”、“狗”]

，向量器将找到3个不同的单词，因此它将输出维度3的向量，其中“the”对应于第一个维度，“cat”对应于第二个维度，“dog”对应于第三个维度。例如，“cat”将被转换为[1,1,0]，“the dog”对应于[1,0,1]，而单词重复的句子会有更大的值（例如“猫”→ [1,2,0]）

对于您想做的事情，您将在软件包中玩得很开心。您只需执行以下操作（在终端中运行ning

pip install zeugma

后）：

来自zeugma导入文本序列的

>>
>>>sequencer=TextsToSequences（）
>>>sequencer.fit_transform（[“这是一个句子。”，“和另一个。”]））
数组（[[1,2,3,4]，[5,6,7]]，dtype=object）

您可以随时访问“单词映射索引：使用

>>> sequencer.index_word
{1: 'this', 2: 'is', 3: 'a', 4: 'sentence', 5: 'and', 6: 'another', 7: 'one'}

在此基础上，您可以使用以下映射转换任何新句子：

>>> sequencer.transform(["a sentence"])
array([[3, 4]])

我希望有帮助

我不认为它会比这更好，我不知道使用

CountVectorizer

实现这一点的方法。不过，我更喜欢列表理解语法：

[[vectorizer.词汇表（x）for x in analyzer（line）]for line in corpus]

@user3914041Hmm。。。我同意列表理解在这种情况下更可取。我不认为它比这更好，我不知道使用

CountVectorizer

实现这一点的方法。不过，我更喜欢列表理解语法：

[[vectorizer.词汇表（x）for x in analyzer（line）]for line in corpus]

@user3914041Hmm。。。我同意列表理解在这种情况下更可取。将列表转换为字符串的更好方法可能是：as_one=''。连接（语料库）将列表转换为字符串的更好方法可能是：as_one=''。连接（语料库）