Python 应为字节或unicode字符串
我一直在尝试做文本分类。有两列操作和类别。我已将数据集分为训练和测试拆分。存在某种np。nan是无效文档,应为字节或unicode字符串Python 应为字节或unicode字符串,python,machine-learning,scikit-learn,text-classification,Python,Machine Learning,Scikit Learn,Text Classification,我一直在尝试做文本分类。有两列操作和类别。我已将数据集分为训练和测试拆分。存在某种np。nan是无效文档,应为字节或unicode字符串 import numpy as np import pandas as pd data1 = pd.read_excel("Kumar_doc_prady.xlsx") Category1=data1['Category'].unique() data1.head(10) Out[138]:
import numpy as np
import pandas as pd
data1 = pd.read_excel("Kumar_doc_prady.xlsx")
Category1=data1['Category'].unique()
data1.head(10)
Out[138]:
Action Category
0 1.Excel based macro would be designed which w... Automation
1 Add a checkpoint in the Audit checklist to ch... Checklist
2 An excel based macro would be created which w... Automation
3 Add a checkpoint in the Audit checklist to ch... Checklist
4 Update the existing automation to delete the u... Checklist
5 Add checkpoints in the existing Audit checklis... Checklist
6 Implement a Peer Audit checklist to verify tha... Checklist
7 Checklist audits would be introduced for sele... CHecklist
8 Add a checkpoint in the Audit checklist to che... Checklist
9 Create an Automation to extract SKU related da... Checklist
from sklearn.preprocessing import LabelEncoder
label = LabelEncoder()
data1["labels1"] = label.fit_transform(data1["Category"])
#data1["Category1"] = label.fit_transform(data1["Category1"])
data1[["Category", "labels1"]].head()
Out[114]:
Category labels1
0 Automation 3
1 Checklist 6
2 Automation 3
3 Checklist 6
4 Checklist 6
from sklearn.model_selection import train_test_split
X_train1, X_test1, y_train1, y_test1 = train_test_split(data1['Action'], data1['labels1'],
random_state=1)
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(strip_accents='ascii', token_pattern=u'(?ui)\\b\\w*[a-z]+\\w*\\b',
lowercase=True, stop_words='english')
X_train1_cv = cv.fit_transform(X_train1)
我在上面的最后一行遇到错误:
Traceback (most recent call last):
File "<ipython-input-142-b8096b8dc028>", line 1, in <module>
X_train1_cv = cv.fit_transform(X_train1)
File "C:\Users\bcpuser\anaconda3\lib\site-
packages\sklearn\feature_extraction\text.py", line 1220, in fit_transform
self.fixed_vocabulary_)
File "C:\Users\bcpuser\anaconda3\lib\site-
packages\sklearn\feature_extraction\text.py", line 1131, in _count_vocab
for feature in analyze(doc):
File "C:\Users\bcpuser\anaconda3\lib\site-
packages\sklearn\feature_extraction\text.py", line 98, in _analyze
doc = decoder(doc)
File "C:\Users\bcpuser\anaconda3\lib\site-
packages\sklearn\feature_extraction\text.py", line 218, in decode
raise ValueError("np.nan is an invalid document, expected byte or "
ValueError: np.nan is an invalid document, expected byte or unicode
string.
回溯(最近一次呼叫最后一次):
文件“”,第1行,在
X_序列1_cv=cv.拟合变换(X_序列1)
文件“C:\Users\bcpuser\anaconda3\lib\site-
packages\sklearn\feature\u extraction\text.py”,第1220行,在fit\u转换中
自我修复(词汇)
文件“C:\Users\bcpuser\anaconda3\lib\site-
packages\sklearn\feature\u extraction\text.py”,第1131行,在
对于分析中的功能(文档):
文件“C:\Users\bcpuser\anaconda3\lib\site-
packages\sklearn\feature\u extraction\text.py“,第98行,in\u analyze
doc=解码器(doc)
文件“C:\Users\bcpuser\anaconda3\lib\site-
packages\sklearn\feature\u extraction\text.py”,第218行,解码
raise VALUERROR(“np.nan是无效文档,应为字节或”
ValueError:np.nan是无效文档,应为字节或unicode
一串
这似乎是某种对象错误使用
X_train1_cv = cv.fit_transform(X_train1.values.astype('U'))
使用将类型从object转换为unicode
X_train1_cv = cv.fit_transform(X_train1.values.astype('U'))
data1['Action']
是您的完整数据集(即不仅是列车部分),而y\u train1
是您的仅培训标签,因此样本数量的差异并不令人惊讶。您应该尝试解决在X\u train1\u cv=cv.fit\u transform(X\u train1)中出现的任何错误
使用X_train1
,而不是通过还原到整个数据集(从而取消上面的列车测试分割)。编辑问题以关注该错误。data1['Action']
是您的完整数据集(即不仅仅是列车部分),而y_train1
是您的仅培训标签,因此样本数量的差异并不令人惊讶。您应该尝试使用X_train1
解决在X_train1\u cv=cv.fit_transform(X_train1)
中出现的任何错误,而不是通过还原到整个数据集(从而取消上面的列车测试分割).编辑问题,将重点放在该错误上。