Python 应为字节或unicode字符串

Python 应为字节或unicode字符串,python,machine-learning,scikit-learn,text-classification,Python,Machine Learning,Scikit Learn,Text Classification,我一直在尝试做文本分类。有两列操作和类别。我已将数据集分为训练和测试拆分。存在某种np。nan是无效文档,应为字节或unicode字符串 import numpy as np import pandas as pd data1 = pd.read_excel("Kumar_doc_prady.xlsx") Category1=data1['Category'].unique() data1.head(10) Out[138]:

我一直在尝试做文本分类。有两列操作和类别。我已将数据集分为训练和测试拆分。存在某种np。nan是无效文档,应为字节或unicode字符串

 import numpy as np
    import pandas as pd

    data1 = pd.read_excel("Kumar_doc_prady.xlsx")
    Category1=data1['Category'].unique()

    data1.head(10)
    Out[138]: 
                                                  Action    Category  
    0  1.​Excel based macro would be designed which w...  Automation        
    1  ​Add a checkpoint in the Audit checklist to ch...   Checklist        
    2  ​An excel based macro would be created which w...  Automation        
    3  ​Add a checkpoint in the Audit checklist to ch...   Checklist        
    4  Update the existing automation to delete the u...   Checklist       
    5  Add checkpoints in the existing Audit checklis...   Checklist        
    6  Implement a Peer Audit checklist to verify tha...   Checklist        
    7  ​Checklist audits would be introduced for sele...   CHecklist        
    8  Add a checkpoint in the Audit checklist to che...   Checklist        
    9  Create an Automation to extract SKU related da...   Checklist        


    from sklearn.preprocessing import LabelEncoder
    label = LabelEncoder()
    data1["labels1"] = label.fit_transform(data1["Category"])
    #data1["Category1"] = label.fit_transform(data1["Category1"])
    data1[["Category", "labels1"]].head()
        Out[114]: 
         Category  labels1
    0  Automation        3
    1   Checklist        6
    2  Automation        3
    3   Checklist        6
    4   Checklist        6



    from sklearn.model_selection import train_test_split
    X_train1, X_test1, y_train1, y_test1 = train_test_split(data1['Action'], data1['labels1'], 
    random_state=1)



    from sklearn.feature_extraction.text import CountVectorizer
    cv = CountVectorizer(strip_accents='ascii', token_pattern=u'(?ui)\\b\\w*[a-z]+\\w*\\b', 
              lowercase=True, stop_words='english')
    X_train1_cv = cv.fit_transform(X_train1)  
我在上面的最后一行遇到错误:

    Traceback (most recent call last):

      File "<ipython-input-142-b8096b8dc028>", line 1, in <module>
        X_train1_cv = cv.fit_transform(X_train1)

      File "C:\Users\bcpuser\anaconda3\lib\site- 
    packages\sklearn\feature_extraction\text.py", line 1220, in fit_transform
        self.fixed_vocabulary_)

      File "C:\Users\bcpuser\anaconda3\lib\site- 
    packages\sklearn\feature_extraction\text.py", line 1131, in _count_vocab
            for feature in analyze(doc):

      File "C:\Users\bcpuser\anaconda3\lib\site- 
    packages\sklearn\feature_extraction\text.py", line 98, in _analyze
        doc = decoder(doc)

      File "C:\Users\bcpuser\anaconda3\lib\site- 
      packages\sklearn\feature_extraction\text.py", line 218, in decode
        raise ValueError("np.nan is an invalid document, expected byte or "



       ValueError: np.nan is an invalid document, expected byte or unicode 
      string.
回溯(最近一次呼叫最后一次):
文件“”,第1行,在
X_序列1_cv=cv.拟合变换(X_序列1)
文件“C:\Users\bcpuser\anaconda3\lib\site-
packages\sklearn\feature\u extraction\text.py”,第1220行,在fit\u转换中
自我修复(词汇)
文件“C:\Users\bcpuser\anaconda3\lib\site-
packages\sklearn\feature\u extraction\text.py”,第1131行,在
对于分析中的功能(文档):
文件“C:\Users\bcpuser\anaconda3\lib\site-
packages\sklearn\feature\u extraction\text.py“,第98行,in\u analyze
doc=解码器(doc)
文件“C:\Users\bcpuser\anaconda3\lib\site-
packages\sklearn\feature\u extraction\text.py”,第218行,解码
raise VALUERROR(“np.nan是无效文档,应为字节或”
ValueError:np.nan是无效文档,应为字节或unicode
一串

这似乎是某种对象错误

使用

X_train1_cv = cv.fit_transform(X_train1.values.astype('U'))

使用将类型从object转换为unicode

X_train1_cv = cv.fit_transform(X_train1.values.astype('U'))

data1['Action']
是您的完整数据集(即不仅是列车部分),而
y\u train1
是您的仅培训标签,因此样本数量的差异并不令人惊讶。您应该尝试解决在
X\u train1\u cv=cv.fit\u transform(X\u train1)中出现的任何错误
使用
X_train1
,而不是通过还原到整个数据集(从而取消上面的列车测试分割)。编辑问题以关注该错误。
data1['Action']
是您的完整数据集(即不仅仅是列车部分),而
y_train1
是您的仅培训标签,因此样本数量的差异并不令人惊讶。您应该尝试使用
X_train1
解决在
X_train1\u cv=cv.fit_transform(X_train1)
中出现的任何错误,而不是通过还原到整个数据集(从而取消上面的列车测试分割).编辑问题,将重点放在该错误上。