Machine learning 为什么以下部分配合不是工作特性？_Machine Learning_Scikit Learn

Machine learning 为什么以下部分配合不是工作特性？

machine-learning scikit-learn

Machine learning 为什么以下部分配合不是工作特性？,machine-learning,scikit-learn,Machine Learning,Scikit Learn,您好，我有以下评论列表： from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer 以下是相应的标签： comments = ['I am very agry','this is not interesting','I am very happy'] clf2 = PassiveAggressiveClassifier() with open('passive.pickle','wb') as

您好，我有以下评论列表：

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

以下是相应的标签：

comments = ['I am very agry','this is not interesting','I am very happy']

clf2 = PassiveAggressiveClassifier()


with open('passive.pickle','wb') as idxf:
    pickle.dump(clf2, idxf, pickle.HIGHEST_PROTOCOL)

with open('passive.pickle', 'rb') as infile:
    clf2 = pickle.load(infile)

with open('tfidf_vectorizer.pickle', 'rb') as infile:
    tfidf_vectorizer = pickle.load(infile)
with open('tfidf.pickle', 'rb') as infile:
    tfidf = pickle.load(infile)

我使用tfidf将这些评论矢量化如下：

sents = ['angry','indiferent','happy']

我正在使用标签编码器对标签进行矢量化：

tfidf_vectorizer = TfidfVectorizer(analyzer='word')
tfidf = tfidf_vectorizer.fit_transform(comments)
from sklearn import preprocessing

在这里，我使用被动攻击来适应模型：

le = preprocessing.LabelEncoder()
le.fit(sents)
labels = le.transform(sents)
print(labels.shape)
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.model_selection import train_test_split
with open('tfidf.pickle','wb') as idxf:
    pickle.dump(tfidf, idxf, pickle.HIGHEST_PROTOCOL)
with open('tfidf_vectorizer.pickle','wb') as idxf:
    pickle.dump(tfidf_vectorizer, idxf, pickle.HIGHEST_PROTOCOL)

在这里，我尝试使用三个新注释及其相应的标签测试partial fit的用法，如下所示：

comments = ['I am very agry','this is not interesting','I am very happy']

clf2 = PassiveAggressiveClassifier()


with open('passive.pickle','wb') as idxf:
    pickle.dump(clf2, idxf, pickle.HIGHEST_PROTOCOL)

with open('passive.pickle', 'rb') as infile:
    clf2 = pickle.load(infile)

with open('tfidf_vectorizer.pickle', 'rb') as infile:
    tfidf_vectorizer = pickle.load(infile)
with open('tfidf.pickle', 'rb') as infile:
    tfidf = pickle.load(infile)

问题是，在进行以下部分拟合后，我没有得到正确的结果：

new_comments = ['I love the life','I hate you','this is not important']
new_labels = [1,0,2]
vec_new_comments = tfidf_vectorizer.transform(new_comments)

print(clf2.predict(vec_new_comments))
clf2.partial_fit(vec_new_comments, new_labels)

但是，我得到了以下输出：

print('AFTER THIS UPDATE THE RESULT SHOULD BE 1,0,2??')
print(clf2.predict(vec_new_comments))

因此，我非常感谢大家支持我发现，如果我使用相同的示例对模型进行测试，那么为什么模型没有更新？所需的输出应该是：

[2 2 2]

我想感谢大家对超参数的支持，以查看所需的输出

这是显示部分配合的完整代码：

[1,0,2]

然而，我得到：

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
import sys
from sklearn.metrics.pairwise import cosine_similarity
import random


comments = ['I am very agry','this is not interesting','I am very happy']
sents = ['angry','indiferent','happy']
tfidf_vectorizer = TfidfVectorizer(analyzer='word')
tfidf = tfidf_vectorizer.fit_transform(comments)
#print(tfidf.shape)
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(sents)
labels = le.transform(sents)

from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.model_selection import train_test_split
with open('tfidf.pickle','wb') as idxf:
    pickle.dump(tfidf, idxf, pickle.HIGHEST_PROTOCOL)
with open('tfidf_vectorizer.pickle','wb') as idxf:
    pickle.dump(tfidf_vectorizer, idxf, pickle.HIGHEST_PROTOCOL)

clf2 = PassiveAggressiveClassifier()

clf2.fit(tfidf, labels)


with open('passive.pickle','wb') as idxf:
    pickle.dump(clf2, idxf, pickle.HIGHEST_PROTOCOL)

with open('passive.pickle', 'rb') as infile:
    clf2 = pickle.load(infile)



with open('tfidf_vectorizer.pickle', 'rb') as infile:
    tfidf_vectorizer = pickle.load(infile)
with open('tfidf.pickle', 'rb') as infile:
    tfidf = pickle.load(infile)

new_comments = ['I love the life','I hate you','this is not important']
new_labels = [1,0,2]

vec_new_comments = tfidf_vectorizer.transform(new_comments)

clf2.partial_fit(vec_new_comments, new_labels)



print('AFTER THIS UPDATE THE RESULT SHOULD BE 1,0,2??')
print(clf2.predict(vec_new_comments))

你的代码有很多问题。首先，我将从显而易见的问题到更复杂的问题：

在clf2学习任何东西之前，您正在对其进行酸洗。你一定义它就腌它，它没有任何作用。如果你只是测试，那就好了。否则，应在fit或等效调用后对其进行酸洗。您在调用clf2.partial_fit之前调用了clf2.fit。这违背了部分匹配的全部目的。当您调用fit时，您基本上修复了模型将了解的类标签。在您的情况下，这是可以接受的，因为在您随后调用partial_fit时，您给出了相同的标签。但这仍然不是一个好的做法

在部分合身的情况下，永远不要称之为合身。始终使用起始数据和新数据调用部分拟合。但是，请确保在第一次调用parital_fit in a parameter Class时提供了希望模型学习的所有标签

现在是最后一部分，关于tfidf_矢量器。您可以调用fit_transform，它本质上是fit，然后在tfidf_矢量器上与comments数组组合进行转换。这意味着，在后续调用转换时，它将不会像您在transformnew_comments中所做的那样，从new_comments中学习新词，而只使用在调用过程中看到的新词，这些词出现在评论中

同样的情况也适用于LabelEncoder和Sent

这在在线学习场景中也不可取。您应该一次拟合所有可用数据。但是，由于您试图使用部分_拟合，我们假设您有非常大的数据集，这些数据集可能无法立即放入内存中。因此，您还需要对TfidfVectorizer应用某种局部拟合。但是TFIDFvectorier不支持部分拟合。事实上，它并不适合大数据。所以你需要改变你的方法。有关更多详细信息，请参见以下问题：-

撇开一切不谈，如果您同时更改拟合整个数据注释和新注释的tfidf部分，您将获得所需的结果

请看下面的代码更改，我可能对其进行了一些组织，并将vec_new_comments重命名为new_tfidf，请仔细阅读：

AFTER THIS UPDATE THE RESULT SHOULD BE 1,0,2??
[2 2 2]

下面是您正在使用的不太受欢迎的代码，我在第2点中谈到了它，但只要您进行上述更改，结果就会很好

comments = ['I am very agry','this is not interesting','I am very happy']
sents = ['angry','indiferent','happy']

new_comments = ['I love the life','I hate you','this is not important']
new_sents = ['happy','angry','indiferent']

tfidf_vectorizer = TfidfVectorizer(analyzer='word')
le = preprocessing.LabelEncoder()

# The below lines are important

# I have given the whole data to fit in tfidf_vectorizer
tfidf_vectorizer.fit(comments + new_comments)

# same for `sents`, but since the labels dont change, it doesnt matter which you use, because it will be same
# le.fit(sents)
le.fit(sents + new_sents)

正确的方法，或拟采用的部分配合方式：

tfidf = tfidf_vectorizer.transform(comments)
labels = le.transform(sents)

clf2.fit(tfidf, labels)
print(clf2.predict(tfidf))
# [0 2 1]

new_tfidf = tfidf_vectorizer.transform(new_comments)
new_labels = le.transform(new_sents)

clf2.partial_fit(new_tfidf, new_labels)
print(clf2.predict(new_tfidf))
# [1 0 2]     As you wanted

如何安装clf2。请将整个代码作为一个代码片段发布。现在一次又一次地复制粘贴是非常烦人的。@VivekKumar我已经更新了问题，我添加了完整的代码来重现我的问题，感谢支持非常感谢支持我终于克服了这种情况