Python 如何使用scikit learn对标记的Bigram进行矢量化？_Python_Machine Learning_Nlp_Scikit Learn_Nltk

Python 如何使用scikit learn对标记的Bigram进行矢量化？

python machine-learning nlp scikit-learn

Python 如何使用scikit learn对标记的Bigram进行矢量化？,python,machine-learning,nlp,scikit-learn,nltk,Python,Machine Learning,Nlp,Scikit Learn,Nltk,我正在自学如何使用scikit学习，我决定用我自己的语料库开始学习。我手工得到了一些大字，比如： training_data = [[('this', 'is'), ('is', 'a'),('a', 'text'), 'POS'], [('and', 'one'), ('one', 'more'), 'NEG'] [('and', 'other'), ('one', 'more'), 'NEU']] 我想将它们矢量化为一种可以很好地用scikit learn提供的一些分类算法（svc、多元朴

我正在自学如何使用scikit学习，我决定用我自己的语料库开始学习。我手工得到了一些大字，比如：

training_data = [[('this', 'is'), ('is', 'a'),('a', 'text'), 'POS'],
[('and', 'one'), ('one', 'more'), 'NEG']
[('and', 'other'), ('one', 'more'), 'NEU']]

我想将它们矢量化为一种可以很好地用scikit learn提供的一些分类算法（svc、多元朴素贝叶斯等）填充的格式。这就是我所尝试的：

from sklearn.feature_extraction.text import CountVectorizer

count_vect = CountVectorizer(analyzer='word')

X = count_vect.transform(((' '.join(x) for x in sample)
                  for sample in training_data))

print X.toarray()

问题是我不知道如何处理标签（即，

'POS'，NEG'，NEU'

），我是否也需要“矢量化”标签，以便将

训练数据

传递给分类算法，或者我可以让它像“POS”或任何其他类型的字符串一样？。另一个问题是，我明白了：

raise ValueError("Vocabulary wasn't fitted or is empty!")
ValueError: Vocabulary wasn't fitted or is empty!

那么，我怎样才能像

training\u data

那样对bigram进行矢量化呢。我还读到了，你们认为使用它们能更好地完成这项任务吗？

应该是这样的：

>>> training_data = [[('this', 'is'), ('is', 'a'),('a', 'text'), 'POS'],
                 [('and', 'one'), ('one', 'more'), 'NEG'],
                 [('and', 'other'), ('one', 'more'), 'NEU']]
>>> count_vect = CountVectorizer(preprocessor=lambda x:x,
                                 tokenizer=lambda x:x)
>>> X = count_vect.fit_transform(doc[:-1] for doc in training_data)

>>> print count_vect.vocabulary_
{('and', 'one'): 1, ('a', 'text'): 0, ('is', 'a'): 3, ('and', 'other'): 2, ('this', 'is'): 5, ('one', 'more'): 4}
>>> print X.toarray()
[[1 0 0 1 0 1]
 [0 1 0 0 1 0]
 [0 0 1 0 1 0]]

然后将标签放入目标变量中：

y = [doc[-1] for doc in training_data] # ['POS', 'NEG', 'NEU']

现在，您可以训练一个模型：

model = SVC()
model.fit(X, y)

实际上，我一直在用这种方式设置标签。问题是，我有一个更大的Bigram列表，但scikit learn如何使用标签来训练和预测某些结果似乎并不清楚。有没有其他类似Python的方法来设置标签，而不是一行一行地设置标签？。谢谢是的，更新了我的答案，还修复了

CountVectorizer

调用，这样它就不会预处理或标记你的bigram。你的代码中有几个小错误，我建议你打开一个与你现在和即将遇到的错误相关的新问题（提示：将标签

的代码与我的代码进行比较）如果您的意思是“如何从列表中提取最后一个元素”，这在python中称为列表理解。类似于

y=[]；对于数据中的子列表：y.append（子列表[-1]）；

，其中

子列表[-1]

表示“子列表的最后一个元素”好的，

countvectorier

可以接收原始文本并输出bigram，查找

ngram\u范围

参数。