Scikit learn 基于单标签数据集的多标签文本分类

Scikit learn 基于单标签数据集的多标签文本分类,scikit-learn,multilabel-classification,Scikit Learn,Multilabel Classification,我有一个数据集,每个文档有一个标签,如下例所示 label text pay "i will pay now" finance "are you the finance guy?" law "lawyers and law" court "was at the court today" finance report "bank reported annual share.."

我有一个数据集,每个文档有一个标签,如下例所示

  label           text

  pay            "i will pay now"
  finance        "are you the finance guy?"
  law            "lawyers and law"
  court          "was at the court today"
  finance report "bank reported annual share.."
文本文档可以使用多个标签进行标记,因此如何在此数据集上进行多标签分类?我已经阅读了很多来自
sklearn
的文档,但是我似乎找不到对单个标签数据集进行多标签分类的正确方法。提前感谢您的帮助

到目前为止,我的情况如下:

import numpy as np
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import LinearSVC
from sklearn.linear_model import SGDClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.metrics import accuracy_score
from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn import preprocessing

loc = r'C:\Users\..\Downloads\excel.xlsx'

df = pd.read_excel(loc)
X = np.array(df.docs)
z = np.array(df.title)
y = np.array(df.raw)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, 
random_state=42)

mlb = preprocessing.MultiLabelBinarizer()
Y = mlb.fit_transform(y_train)
Y_test = mlb.fit_transform(y_test)

classifier = Pipeline([
('vectorizer', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', OneVsRestClassifier(LinearSVC()))])

  classifier.fit(X_train, Y)
  predicted = classifier.predict(X_test)

 doc_new = np.array(['X has announced that it will sell $587 million'])

 print("Accuracy Score: ", accuracy_score(Y_test, predicted))
 print(mlb.inverse_transform(classifier.predict(doc_new)))
但我一直得到一个尺寸误差:

.format(len(self.classes),yt.shape[1])值错误:44个类的预期指标,但得到46个


我提出了解决办法。我用熊猫团购

df = pd.DataFrame(df.groupby(["id", "doc"]).label.apply(list)).reset_index()
将文本与多个类组合在一起,效果良好

尺寸误差也已解决: