Python scikit学习中的词汇匹配问题?

Python scikit学习中的词汇匹配问题?,python,machine-learning,nlp,scikit-learn,Python,Machine Learning,Nlp,Scikit Learn,我有一个满是.txt文件(文档)的目录。首先,我加载文档,去掉一些括号并删除一些引号,因此文档如下所示,例如: document1: is a scientific discipline that explores the construction and study of algorithms that can learn from data Such algorithms operate by building a model document2: Machine learning can

我有一个满是
.txt
文件(文档)的目录。首先,我
加载
文档,去掉一些括号并删除一些引号,因此文档如下所示,例如:

document1:
is a scientific discipline that explores the construction and study of algorithms that can learn from data Such algorithms operate by building a model

document2:
Machine learning can be considered a subfield of computer science and statistics It has strong ties to artificial intelligence and optimization which deliver methods
我从目录中加载文件,如下所示:

preprocessDocuments =[[' '.join(x) for x in sample[:-1]] for sample in load(directory)]


documents = ''.join( i for i in ''.join(str(v) for v
                                              in preprocessDocuments) if i not in "',()")
然后我将对
document1
document2
进行矢量化,以创建如下培训矩阵:

from sklearn.feature_extraction.text import HashingVectorizer
vectorizer = HashingVectorizer(analyzer='word')
X = HashingVectorizer.fit_transform(documents)
X.toarray()
那么这就是输出:

    raise ValueError("empty vocabulary; perhaps the documents only"
ValueError: empty vocabulary; perhaps the documents only contain stop words

在这种情况下,如何创建向量表示?。我以为我在
文档中携带加载的文件,但文档似乎无法安装。

文档的内容是什么?它应该是一个文件名或带有标记的字符串的列表。另外,应该使用对象调用fit_变换,而不是像静态方法那样,即。e<代码>矢量器。拟合变换(文档)

例如,这在这里起作用:

from sklearn.feature_extraction.text import HashingVectorizer
documents=['this is a test', 'another test']
vectorizer = HashingVectorizer(analyzer='word')
X = vectorizer.fit_transform(documents)

感谢您的反馈,当我打印
文档
时,我得到以下信息:
[一个非常大的文本][另一个非常大的文本][第三个非常大的文本]
,三个代表我在目录中的3.txt文件的列表,这里有什么建议吗?是的,您的
文档
应该是一个列表,其中每个元素都是一个带有标记文档的字符串。类似于:
documents=['word\u 1\u doc\u 1 word\u 2\u doc\u 1','word\u 1\u doc\u 2 word\u 2\u doc\u 2…','word\u 1\u doc\u 3,word\u 2\u doc\u 3…][/code>。如果您执行类似于
documents=[''.join(ii)for ii in documents]
的操作,它可能会起作用。