Python 了解如何将MultiOutputClassifier与多标签文本分类结合使用_Python_Scikit Learn_Classification_Text Classification

Python 了解如何将MultiOutputClassifier与多标签文本分类结合使用

python scikit-learn

Python 了解如何将MultiOutputClassifier与多标签文本分类结合使用,python,scikit-learn,classification,text-classification,Python,Scikit Learn,Classification,Text Classification,我正在尝试做多输出多标签多类文本分类。下面的示例可以工作，但我知道它没有正确使用MultiOutputClassifier。我相信关键是只需要训练一次，适应一次，即使是多次输出。我如何通过一次数据传递来实现这一点 import numpy as np from sklearn.pipeline import Pipeline from sklearn.feature_extraction.text import CountVectorizer from sklearn.svm import Li

我正在尝试做多输出多标签多类文本分类。下面的示例可以工作，但我知道它没有正确使用MultiOutputClassifier。我相信关键是只需要训练一次，适应一次，即使是多次输出。我如何通过一次数据传递来实现这一点

import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.multioutput import MultiOutputClassifier

X_train = np.array(["new york is a really big city",
                    "new york was originally dutch",
                    "the big apple is huge",
                    "new york is also called the big apple",
                    "nyc is nice",
                    "people abbreviate new york city as nyc",
                    "edinburgh is a small city in great britain",
                    "a northern city in great britain is edinburgh",
                    "edinburgh is in the uk",
                    "edinburgh is in england",
                    "edinburgh is in great britain",
                    "edinburgh is not big",
                    "edinburgh hosts the holyrood palace and new york hosts the empire state building",
                    "nyc is big and edinburgh is smaller",
                    "i like edinburgh better than new york"])
y_train_text_1 = [["new york"],["new york"],["new york"],["new york"],["new york"],
                ["new york"],["edinburgh"],["edinburgh"],["edinburgh"],["edinburgh"],
                ["edinburgh"],["edinburgh"],["new york","edinburgh"],["edinburgh"],["new york","edinburgh"]]
y_train_text_2 = [["big"],[""],["big"],["big"],[""],
                [""],["small"],[""],[""],[""],
                [""],["big"],[""],["big","small"],[""]]

X_test = np.array(['nice day in nyc',
                   'my big day in edinburgh',
                   'edinburgh is small but nyc is big',
                   'it is raining in britain',
                   'it is raining in britain and the big apple',
                   'it is raining in britain and nyc',
                   'hello welcome to new york. enjoy it here and edinburgh too'])

mlb_1 = MultiLabelBinarizer()
Y_1 = mlb_1.fit_transform(y_train_text_1)

classifier = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', MultiOutputClassifier(OneVsRestClassifier(LinearSVC())))])

classifier.fit(X_train, Y_1)
predicted = classifier.predict(X_test)
all_labels = mlb_1.inverse_transform(predicted)

print('city name classifier:')
for item, labels in zip(X_test, all_labels):
    print('{0} => {1}'.format(item, ', '.join(labels)))

# Now fit on second output - can I do it all at once?
mlb_2 = MultiLabelBinarizer()
Y_2 = mlb_2.fit_transform(y_train_text_2)
classifier.fit(X_train, Y_2)
predicted = classifier.predict(X_test)
all_labels = mlb_2.inverse_transform(predicted)

print('city size classifier:')
for item, labels in zip(X_test, all_labels):
    print('{0} => {1}'.format(item, ', '.join(labels)))

运行它的输出：

city name classifier:
nice day in nyc => new york
my big day in edinburgh => edinburgh
edinburgh is small but nyc is big => edinburgh
it is raining in britain => edinburgh
it is raining in britain and the big apple => edinburgh
it is raining in britain and nyc => edinburgh
hello welcome to new york. enjoy it here and edinburgh too => edinburgh, new york
city size classifier:
nice day in nyc => 
my big day in edinburgh => 
edinburgh is small but nyc is big => big, small
it is raining in britain => 
it is raining in britain and the big apple => big
it is raining in britain and nyc => 
hello welcome to new york. enjoy it here and edinburgh too =>

我认为sklearn不支持一次性培训多输出多标签，因为文档中说“多输出多类分类……输出格式是2d numpy数组或稀疏矩阵。”。如果它支持，输出格式应该是n-d（例如，如果有两个像你这样的目标，那么它必须是3d的）@YohanesGultom是的，你是对的。Scikit learn目前无法处理多类问题的多输出和多标签混合。它的