Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/317.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python np.nan是无效文档,CountVectorizer中需要字节或unicode字符串_Python_Unicode_Countvectorizer - Fatal编程技术网

Python np.nan是无效文档,CountVectorizer中需要字节或unicode字符串

Python np.nan是无效文档,CountVectorizer中需要字节或unicode字符串,python,unicode,countvectorizer,Python,Unicode,Countvectorizer,我试图为每个非数值属性创建依赖列,并从UCI中删除成人数据集中的非数值属性。我正在使用sklearn.feature\u extraction.text lib中的CountVectorizer。但我的程序告诉我,np.nan是无效的文档,应该是字节或unicode字符串 我只是想知道为什么我会犯这样的错误。谁能帮我一下,谢谢 这是我的密码 import pandas as pd from sklearn.cross_validation import train_test_split from

我试图为每个非数值属性创建依赖列,并从UCI中删除成人数据集中的非数值属性。我正在使用sklearn.feature\u extraction.text lib中的CountVectorizer。但我的程序告诉我,np.nan是无效的文档,应该是字节或unicode字符串

我只是想知道为什么我会犯这样的错误。谁能帮我一下,谢谢

这是我的密码

import pandas as pd
from sklearn.cross_validation import train_test_split
from sklearn.feature_extraction.text import CountVectorizer

def check(ex):
    try:
        int(ex)
        return False
    except ValueError:
        return True

feature_cols = ['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'Target']

data = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data', header=None, names = feature_cols)

feature_cols.remove('Target')
X = data[feature_cols]
y = data['Target']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 1)

columns = X.columns

vect = CountVectorizer()

for each in columns:
    if check(X[each][1]):
        temp = X[each]
        X_dtm = pd.DataFrame(vect.fit_transform(temp).toarray(), columns = vect.get_feature_names())
        X = pd.merge(X, X_dtm, how='outer')
        X = X.drop(each, 1)

print X.columns
错误是这样的

回溯(最近一次呼叫最后一次): 文件“/home/amey/prog/pd.py”,第41行,在 X_dtm=pd.DataFrame(vect.fit_transform(temp.toarray(),columns=vect.get_feature_names())

文件“/usr/lib/python2.7/dist packages/sklearn/feature\u extraction/text.py”,第817行,在fit\u transform中 自我修复(词汇)

文件“/usr/lib/python2.7/dist packages/sklearn/feature\u extraction/text.py”,第752行,在 对于分析中的功能(文档):

文件“/usr/lib/python2.7/dist packages/sklearn/feature_extraction/text.py”,第238行,在 标记化(预处理(自解码(doc))、停止字)

文件“/usr/lib/python2.7/dist packages/sklearn/feature_extraction/text.py”,第118行,在decode中

raise ValueError("np.nan is an invalid document, expected byte or "
ValueError:np.nan是无效文档,应为字节或unicode字符串


[在3.3s中完成,退出代码为1]

请提供完整的堆栈跟踪我已添加堆栈跟踪tooplz参考此答案:请提供完整的堆栈跟踪我已添加堆栈跟踪tooplz参考此答案: