Python 分类器有时不影响文档分类

Python 分类器有时不影响文档分类,python,scikit-learn,classification,document-classification,Python,Scikit Learn,Classification,Document Classification,我目前正在尝试将一些文档分类为固定数量的类别。 这里的主要问题是,有时分类器似乎找不到合适的类别。因此,输出为空 我正在使用以下代码: mlb = MultiLabelBinarizer() Y = mlb.fit_transform(y_train_text) class DenseTransformer(TransformerMixin): def transform(self, X, y=None, **fit_params): return X.todense(

我目前正在尝试将一些文档分类为固定数量的类别。 这里的主要问题是,有时分类器似乎找不到合适的类别。因此,输出为空

我正在使用以下代码:

mlb = MultiLabelBinarizer()
Y = mlb.fit_transform(y_train_text)

class DenseTransformer(TransformerMixin):
    def transform(self, X, y=None, **fit_params):
        return X.todense()

    def fit_transform(self, X, y=None, **fit_params):
        self.fit(X, y, **fit_params)
        return self.transform(X)

    def fit(self, X, y=None, **fit_params):
        return self

classifier = Pipeline([
    ('vectorizer', TfidfVectorizer(stop_words='english')),
    ('to_dense', DenseTransformer()),
    ('clf', OneVsRestClassifier(GaussianNB()))])

classifier.fit(X_train, Y)
predicted = classifier.predict(X_test)
all_labels = mlb.inverse_transform(predicted)
输出的一个示例:

doc 0 : ""
doc 1 : "news"
doc 2 : "spam"
doc 3 : ""
doc 4 : ""
doc 5 : "news"
doc 6 : "tech-news"
原则不是使用相似性比较为每个文档分配一个类别吗? (tf idf表示文档中单词的频率)

编辑:样本代码

X_train = np.array(["new york is a hell of a town",
                    "new york was originally dutch",
                    "the big apple is great",
                    "new york is also called the big apple",
                    "nyc is nice",
                    "people abbreviate new york city as nyc",
                    "the capital of great britain is london",
                    "london is in the uk",
                    "london is in england",
                    "london is in great britain",
                    "it rains a lot in london",
                    "london hosts the british museum",
                    "new york is great and so is london",
                    "i like london better than new york",
                    "I love fruits mate",
                    "I usually eat apples",
                    "we should go for bananas or other fruits"])
y_train_text = [["new york"],["new york"],["new york"],["new york"],["new york"],
                ["new york"],["london"],["london"],["london"],["london"],
                ["london"],["london"],["new york","london"],["new york","london"],["Fruits"],["Fruits"],["Fruits"]]

X_test = np.array(['nice day in nyc',
                   'welcome to london',
                   'london is rainy',
                   'it is raining in britian',
                   'it is raining in britian and the big apple',
                   'it is raining in britian and nyc',
                   'hello welcome to new york. enjoy it here and london too',
                   'how about fruits like apples or something today ?',
                   'shall we go for apples ?'])
target_names = ['New York', 'London','Fruits']



    classifier = Pipeline([
        ('vectorizer', CountVectorizer(stop_words='english')),
        ('tfidf', TfidfTransformer()),
        ('clf', OneVsRestClassifier(SVC(kernel="linear",decision_function_shape='ovo')))])

classifier.fit(X_train, Y)
predicted = classifier.predict(X_test)
all_labels = mlb.inverse_transform(predicted)
此示例代码提供以下输出:

nice day in nyc => new york
welcome to london => london
london is rainy => london
it is raining in britian => 
it is raining in britian and the big apple => new york
it is raining in britian and nyc => new york
hello welcome to new york. enjoy it here and london too => london, new york
how about fruits like apples or something today ? => Fruits
shall we go for apples ? => Fruits

仔细检查y_train_文本中的值是否均为空字符串为什么同时使用TFIDFvectorier和TfidfTransformer。如果需要,TfidfVectorizer将在内部使用TfidfTransformer。我看不出这对你有什么帮助。你是否遵循任何教程?如果是,请链接到它。并发布一些示例,以便我们可以复制问题。@aberger:done,它们都是正确的值。我从stackO那里做了这个代码的改编版本。我一直在修改矢量器。为了得到更好的结果。我忘了取下变压器(因为这里没用)。谢谢你指出这一点。@VivekKumar:我在最后一条评论中回答了^^(忘了提到你了)你能展示一些样本数据(X,Y)来重现这一点吗?