Python 应为字节或unicode字符串_Python_Machine Learning_Scikit Learn_Text Classification

Python 应为字节或unicode字符串

python machine-learning scikit-learn

Python 应为字节或unicode字符串,python,machine-learning,scikit-learn,text-classification,Python,Machine Learning,Scikit Learn,Text Classification,我一直在尝试做文本分类。有两列操作和类别。我已将数据集分为训练和测试拆分。存在某种np。nan是无效文档，应为字节或unicode字符串 import numpy as np import pandas as pd data1 = pd.read_excel("Kumar_doc_prady.xlsx") Category1=data1['Category'].unique() data1.head(10) Out[138]:

我一直在尝试做文本分类。有两列操作和类别。我已将数据集分为训练和测试拆分。存在某种np。nan是无效文档，应为字节或unicode字符串

 import numpy as np
    import pandas as pd

    data1 = pd.read_excel("Kumar_doc_prady.xlsx")
    Category1=data1['Category'].unique()

    data1.head(10)
    Out[138]: 
                                                  Action    Category  
    0  1.Excel based macro would be designed which w...  Automation        
    1  Add a checkpoint in the Audit checklist to ch...   Checklist        
    2  An excel based macro would be created which w...  Automation        
    3  Add a checkpoint in the Audit checklist to ch...   Checklist        
    4  Update the existing automation to delete the u...   Checklist       
    5  Add checkpoints in the existing Audit checklis...   Checklist        
    6  Implement a Peer Audit checklist to verify tha...   Checklist        
    7  Checklist audits would be introduced for sele...   CHecklist        
    8  Add a checkpoint in the Audit checklist to che...   Checklist        
    9  Create an Automation to extract SKU related da...   Checklist        


    from sklearn.preprocessing import LabelEncoder
    label = LabelEncoder()
    data1["labels1"] = label.fit_transform(data1["Category"])
    #data1["Category1"] = label.fit_transform(data1["Category1"])
    data1[["Category", "labels1"]].head()
        Out[114]: 
         Category  labels1
    0  Automation        3
    1   Checklist        6
    2  Automation        3
    3   Checklist        6
    4   Checklist        6



    from sklearn.model_selection import train_test_split
    X_train1, X_test1, y_train1, y_test1 = train_test_split(data1['Action'], data1['labels1'], 
    random_state=1)



    from sklearn.feature_extraction.text import CountVectorizer
    cv = CountVectorizer(strip_accents='ascii', token_pattern=u'(?ui)\\b\\w*[a-z]+\\w*\\b', 
              lowercase=True, stop_words='english')
    X_train1_cv = cv.fit_transform(X_train1)

我在上面的最后一行遇到错误：

    Traceback (most recent call last):

      File "<ipython-input-142-b8096b8dc028>", line 1, in <module>
        X_train1_cv = cv.fit_transform(X_train1)

      File "C:\Users\bcpuser\anaconda3\lib\site- 
    packages\sklearn\feature_extraction\text.py", line 1220, in fit_transform
        self.fixed_vocabulary_)

      File "C:\Users\bcpuser\anaconda3\lib\site- 
    packages\sklearn\feature_extraction\text.py", line 1131, in _count_vocab
            for feature in analyze(doc):

      File "C:\Users\bcpuser\anaconda3\lib\site- 
    packages\sklearn\feature_extraction\text.py", line 98, in _analyze
        doc = decoder(doc)

      File "C:\Users\bcpuser\anaconda3\lib\site- 
      packages\sklearn\feature_extraction\text.py", line 218, in decode
        raise ValueError("np.nan is an invalid document, expected byte or "



       ValueError: np.nan is an invalid document, expected byte or unicode 
      string.

回溯（最近一次呼叫最后一次）：
文件“”，第1行，在
X_序列1_cv=cv.拟合变换（X_序列1）
文件“C:\Users\bcpuser\anaconda3\lib\site-
packages\sklearn\feature\u extraction\text.py”，第1220行，在fit\u转换中
自我修复（词汇）
文件“C:\Users\bcpuser\anaconda3\lib\site-
packages\sklearn\feature\u extraction\text.py”，第1131行，在
对于分析中的功能（文档）：
文件“C:\Users\bcpuser\anaconda3\lib\site-
packages\sklearn\feature\u extraction\text.py“，第98行，in\u analyze
doc=解码器（doc）
文件“C:\Users\bcpuser\anaconda3\lib\site-
packages\sklearn\feature\u extraction\text.py”，第218行，解码
raise VALUERROR（“np.nan是无效文档，应为字节或”
ValueError:np.nan是无效文档，应为字节或unicode
一串

这似乎是某种对象错误

使用

X_train1_cv = cv.fit_transform(X_train1.values.astype('U'))

使用将类型从object转换为unicode

X_train1_cv = cv.fit_transform(X_train1.values.astype('U'))

data1['Action']

是您的完整数据集（即不仅是列车部分），而

y\u train1

是您的仅培训标签，因此样本数量的差异并不令人惊讶。您应该尝试解决在

X\u train1\u cv=cv.fit\u transform（X\u train1）中出现的任何错误

使用

X_train1

，而不是通过还原到整个数据集（从而取消上面的列车测试分割）。编辑问题以关注该错误。

data1['Action']

是您的完整数据集（即不仅仅是列车部分），而

y_train1

是您的仅培训标签，因此样本数量的差异并不令人惊讶。您应该尝试使用

X_train1

解决在

X_train1\u cv=cv.fit_transform（X_train1）

中出现的任何错误，而不是通过还原到整个数据集（从而取消上面的列车测试分割）.编辑问题，将重点放在该错误上。