Python 使用计数矢量器和TF-IDF转换器时出错

Python 使用计数矢量器和TF-IDF转换器时出错,python,machine-learning,scikit-learn,Python,Machine Learning,Scikit Learn,我编写了以下代码来转换一些数据: from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer def transform (data): vectorizer = CountVectorizer(analyzer = "word", tokenizer = None, preprocessor = None, stop_words = None) clean = vectorizer.

我编写了以下代码来转换一些数据:

from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
def transform (data):
    vectorizer = CountVectorizer(analyzer = "word", tokenizer = None, preprocessor = None, stop_words = None)
    clean = vectorizer.fit_transform(data)
    clean_tfidf_transformer = TfidfTransformer()
    clean_tfidf = clean_tfidf_transformer.fit_transform(clean)
    return clean_tfidf, clean_tfidf.shape[1]
但是,在某些数据上运行时,会产生以下错误:

Traceback (most recent call last):
  File "...", line 169, in <module>
    X, y = process(directory, filename)
  File "...", line 132, in process
    tr_abstract, abstractN = transform(pre_abstract)
  File "...", line 77, in transform
    clean = vectorizer.fit_transform(data)
  File ".../anaconda/lib/python3.5/site-packages/sklearn/feature_extraction/text.py", line 817, in fit_transform
    self.fixed_vocabulary_)
  File ".../anaconda/lib/python3.5/site-packages/sklearn/feature_extraction/text.py", line 752, in _count_vocab
    for feature in analyze(doc):
  File ".../anaconda/lib/python3.5/site-packages/sklearn/feature_extraction/text.py", line 238, in <lambda>
    tokenize(preprocess(self.decode(doc))), stop_words)
  File ".../anaconda/lib/python3.5/site-packages/sklearn/feature_extraction/text.py", line 118, in decode
    raise ValueError("np.nan is an invalid document, expected byte or "
ValueError: np.nan is an invalid document, expected byte or unicode string.
回溯(最近一次呼叫最后一次):
文件“…”,第169行,在
十、 y=进程(目录、文件名)
文件“…”,第132行,正在处理中
tr_abstract,abstractN=转换(pre_abstract)
文件“…”,第77行,在转换中
clean=矢量器。拟合_变换(数据)
文件“../anaconda/lib/python3.5/site packages/sklearn/feature\u extraction/text.py”,第817行,在fit\u transform中
自我修复(词汇)
文件“../anaconda/lib/python3.5/site packages/sklearn/feature\u extraction/text.py”,第752行,在
对于分析中的功能(文档):
文件“../anaconda/lib/python3.5/site packages/sklearn/feature_extraction/text.py”,第238行,在
标记化(预处理(自解码(doc))、停止字)
文件“../anaconda/lib/python3.5/site packages/sklearn/feature\u extraction/text.py”,第118行,在decode中
raise VALUERROR(“np.nan是无效文档,应为字节或”
ValueError:np.nan是无效文档,应为字节或unicode字符串。

这意味着什么?

您的数据缺少值,以下代码可能会重现错误

from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
import numpy as np

vectorizer = CountVectorizer(analyzer = "word", tokenizer=None, preprocessor=None, stop_words=None)
clean = vectorizer.fit_transform([u'i am shane', np.nan])

您的数据缺少值,下面的代码可以重现错误

from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
import numpy as np

vectorizer = CountVectorizer(analyzer = "word", tokenizer=None, preprocessor=None, stop_words=None)
clean = vectorizer.fit_transform([u'i am shane', np.nan])

在使用
tfidf
tfidf.fit\u transform
时,我也遇到了同样的错误。这里的其他答案都不起作用,所以我运行了

df['data'] = df['data'].astype(str) 

然后,它成功了!试试这个

我在使用
tfidf
tfidf.fit_transform
时也遇到了同样的错误。这里的其他答案都不起作用,所以我跑了

df['data'] = df['data'].astype(str) 
然后,它成功了!请尝试此方法。

您应该使用文档中提到的方法,相当于CountVectorizer后跟TfidTransformer,而不是单独使用它们。您应该使用文档中提到的方法,相当于CountVectorizer后跟TfidTransformer,而不是单独使用它们。