Warning: file_get_contents(/data/phpspider/zhask/data//catemap/1/ms-access/4.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 使用TfidfVectorizer的自然语言处理_Python_Scikit Learn_Nlp_Tfidfvectorizer_Natural Language Processing - Fatal编程技术网

Python 使用TfidfVectorizer的自然语言处理

Python 使用TfidfVectorizer的自然语言处理,python,scikit-learn,nlp,tfidfvectorizer,natural-language-processing,Python,Scikit Learn,Nlp,Tfidfvectorizer,Natural Language Processing,我正在读取文件train1.txt中的字符串。但是当尝试执行语句tfidf.fit(dataset)时,它会导致错误。我无法完全修复错误。正在寻求帮助 错误日志: from sklearn.feature_extraction.text import TfidfVectorizer filename='train1.txt' dataset=[] with open(filename) as f: for line in f: dataset.append([str(n)

我正在读取文件train1.txt中的字符串。但是当尝试执行语句tfidf.fit(dataset)时,它会导致错误。我无法完全修复错误。正在寻求帮助

错误日志:

from sklearn.feature_extraction.text import TfidfVectorizer
filename='train1.txt'
dataset=[]
with open(filename) as f:
    for line in f:
        dataset.append([str(n) for n in line.strip().split(',')])
print (dataset)
tfidf=TfidfVectorizer()
tfidf.fit(dataset)
dict1=tfidf.vocabulary_
print 'Using tfidfVectorizer'
for key in dict1.keys():
    print key+" "+ str(dict1[key])
回溯(最近一次呼叫最后一次):
文件“Q1.py”,第52行,在
tfidf.fit(数据集)
文件“/opt/anaconda2/lib/python2.7/site packages/sklearn/feature_extraction/text.py”,第1361行,以适合的形式
X=super(TfidfVectorizer,self).fit\u转换(原始文档)
文件“/opt/anaconda2/lib/python2.7/site packages/sklearn/feature\u extraction/text.py”,第869行,在fit\u transform中
自我修复(词汇)
文件“/opt/anaconda2/lib/python2.7/site packages/sklearn/feature\u extraction/text.py”,第792行,在
对于分析中的功能(文档):
文件“/opt/anaconda2/lib/python2.7/site packages/sklearn/feature_extraction/text.py”,第266行,在
标记化(预处理(自解码(doc))、停止字)
文件“/opt/anaconda2/lib/python2.7/site packages/sklearn/feature_extraction/text.py”,第232行,在
返回lambda x:strip_重音(x.lower())
AttributeError:“list”对象没有属性“lower”
根据for TfidfVectorizer,fit函数期望“生成str、unicode或文件对象的iterable”作为其第一个参数。您提供的是一个列表列表,它不满足此要求

您已经使用
split
方法将每一行转换为字符串列表,因此您要么需要重新加入字符串,要么完全避免拆分它。当然,这取决于您的输入格式

如果修改行,它应该可以工作

Traceback (most recent call last):
  File "Q1.py", line 52, in <module>
    tfidf.fit(dataset)
  File "/opt/anaconda2/lib/python2.7/site-packages/sklearn/feature_extraction/text.py", line 1361, in fit
    X = super(TfidfVectorizer, self).fit_transform(raw_documents)
  File "/opt/anaconda2/lib/python2.7/site-packages/sklearn/feature_extraction/text.py", line 869, in fit_transform
    self.fixed_vocabulary_)
  File "/opt/anaconda2/lib/python2.7/site-packages/sklearn/feature_extraction/text.py", line 792, in _count_vocab
    for feature in analyze(doc):
  File "/opt/anaconda2/lib/python2.7/site-packages/sklearn/feature_extraction/text.py", line 266, in <lambda>
    tokenize(preprocess(self.decode(doc))), stop_words)
  File "/opt/anaconda2/lib/python2.7/site-packages/sklearn/feature_extraction/text.py", line 232, in <lambda>
    return lambda x: strip_accents(x.lower())
AttributeError: 'list' object has no attribute 'lower'
根据您的输入格式,您可能需要将其替换为以下内容

dataset.append([str(n) for n in line.strip().split(',')])
或者干脆

dataset.append(" ".join([str(n) for n in line.strip().split(',')]))

(我只能猜测输入文本中“,”的用法)。

请发布错误日志
dataset.append(line.strip().replace(",", " "))