Machine learning 为什么以下部分配合不是工作特性?

Machine learning 为什么以下部分配合不是工作特性?,machine-learning,scikit-learn,Machine Learning,Scikit Learn,您好,我有以下评论列表: from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer 以下是相应的标签: comments = ['I am very agry','this is not interesting','I am very happy'] clf2 = PassiveAggressiveClassifier() with open('passive.pickle','wb') as

您好,我有以下评论列表:

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
以下是相应的标签:

comments = ['I am very agry','this is not interesting','I am very happy']
clf2 = PassiveAggressiveClassifier()


with open('passive.pickle','wb') as idxf:
    pickle.dump(clf2, idxf, pickle.HIGHEST_PROTOCOL)

with open('passive.pickle', 'rb') as infile:
    clf2 = pickle.load(infile)

with open('tfidf_vectorizer.pickle', 'rb') as infile:
    tfidf_vectorizer = pickle.load(infile)
with open('tfidf.pickle', 'rb') as infile:
    tfidf = pickle.load(infile)
我使用tfidf将这些评论矢量化如下:

sents = ['angry','indiferent','happy']
我正在使用标签编码器对标签进行矢量化:

tfidf_vectorizer = TfidfVectorizer(analyzer='word')
tfidf = tfidf_vectorizer.fit_transform(comments)
from sklearn import preprocessing
在这里,我使用被动攻击来适应模型:

le = preprocessing.LabelEncoder()
le.fit(sents)
labels = le.transform(sents)
print(labels.shape)
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.model_selection import train_test_split
with open('tfidf.pickle','wb') as idxf:
    pickle.dump(tfidf, idxf, pickle.HIGHEST_PROTOCOL)
with open('tfidf_vectorizer.pickle','wb') as idxf:
    pickle.dump(tfidf_vectorizer, idxf, pickle.HIGHEST_PROTOCOL)
在这里,我尝试使用三个新注释及其相应的标签测试partial fit的用法,如下所示:

comments = ['I am very agry','this is not interesting','I am very happy']
clf2 = PassiveAggressiveClassifier()


with open('passive.pickle','wb') as idxf:
    pickle.dump(clf2, idxf, pickle.HIGHEST_PROTOCOL)

with open('passive.pickle', 'rb') as infile:
    clf2 = pickle.load(infile)

with open('tfidf_vectorizer.pickle', 'rb') as infile:
    tfidf_vectorizer = pickle.load(infile)
with open('tfidf.pickle', 'rb') as infile:
    tfidf = pickle.load(infile)
问题是,在进行以下部分拟合后,我没有得到正确的结果:

new_comments = ['I love the life','I hate you','this is not important']
new_labels = [1,0,2]
vec_new_comments = tfidf_vectorizer.transform(new_comments)

print(clf2.predict(vec_new_comments))
clf2.partial_fit(vec_new_comments, new_labels)
但是,我得到了以下输出:

print('AFTER THIS UPDATE THE RESULT SHOULD BE 1,0,2??')
print(clf2.predict(vec_new_comments))
因此,我非常感谢大家支持我发现,如果我使用相同的示例对模型进行测试,那么为什么模型没有更新?所需的输出应该是:

[2 2 2]
我想感谢大家对超参数的支持,以查看所需的输出

这是显示部分配合的完整代码:

[1,0,2]
然而,我得到:

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
import sys
from sklearn.metrics.pairwise import cosine_similarity
import random


comments = ['I am very agry','this is not interesting','I am very happy']
sents = ['angry','indiferent','happy']
tfidf_vectorizer = TfidfVectorizer(analyzer='word')
tfidf = tfidf_vectorizer.fit_transform(comments)
#print(tfidf.shape)
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(sents)
labels = le.transform(sents)

from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.model_selection import train_test_split
with open('tfidf.pickle','wb') as idxf:
    pickle.dump(tfidf, idxf, pickle.HIGHEST_PROTOCOL)
with open('tfidf_vectorizer.pickle','wb') as idxf:
    pickle.dump(tfidf_vectorizer, idxf, pickle.HIGHEST_PROTOCOL)

clf2 = PassiveAggressiveClassifier()

clf2.fit(tfidf, labels)


with open('passive.pickle','wb') as idxf:
    pickle.dump(clf2, idxf, pickle.HIGHEST_PROTOCOL)

with open('passive.pickle', 'rb') as infile:
    clf2 = pickle.load(infile)



with open('tfidf_vectorizer.pickle', 'rb') as infile:
    tfidf_vectorizer = pickle.load(infile)
with open('tfidf.pickle', 'rb') as infile:
    tfidf = pickle.load(infile)

new_comments = ['I love the life','I hate you','this is not important']
new_labels = [1,0,2]

vec_new_comments = tfidf_vectorizer.transform(new_comments)

clf2.partial_fit(vec_new_comments, new_labels)



print('AFTER THIS UPDATE THE RESULT SHOULD BE 1,0,2??')
print(clf2.predict(vec_new_comments))

你的代码有很多问题。首先,我将从显而易见的问题到更复杂的问题:

在clf2学习任何东西之前,您正在对其进行酸洗。你一定义它就腌它,它没有任何作用。如果你只是测试,那就好了。否则,应在fit或等效调用后对其进行酸洗。 您在调用clf2.partial_fit之前调用了clf2.fit。这违背了部分匹配的全部目的。当您调用fit时,您基本上修复了模型将了解的类标签。在您的情况下,这是可以接受的,因为在您随后调用partial_fit时,您给出了相同的标签。但这仍然不是一个好的做法

在部分合身的情况下,永远不要称之为合身。始终使用起始数据和新数据调用部分拟合。但是,请确保在第一次调用parital_fit in a parameter Class时提供了希望模型学习的所有标签

现在是最后一部分,关于tfidf_矢量器。您可以调用fit_transform,它本质上是fit,然后在tfidf_矢量器上与comments数组组合进行转换。这意味着,在后续调用转换时,它将不会像您在transformnew_comments中所做的那样,从new_comments中学习新词,而只使用在调用过程中看到的新词,这些词出现在评论中

同样的情况也适用于LabelEncoder和Sent

这在在线学习场景中也不可取。您应该一次拟合所有可用数据。但是,由于您试图使用部分_拟合,我们假设您有非常大的数据集,这些数据集可能无法立即放入内存中。因此,您还需要对TfidfVectorizer应用某种局部拟合。但是TFIDFvectorier不支持部分拟合。事实上,它并不适合大数据。所以你需要改变你的方法。有关更多详细信息,请参见以下问题:-

撇开一切不谈,如果您同时更改拟合整个数据注释和新注释的tfidf部分,您将获得所需的结果

请看下面的代码更改,我可能对其进行了一些组织,并将vec_new_comments重命名为new_tfidf,请仔细阅读:

AFTER THIS UPDATE THE RESULT SHOULD BE 1,0,2??
[2 2 2]
下面是您正在使用的不太受欢迎的代码,我在第2点中谈到了它,但只要您进行上述更改,结果就会很好

comments = ['I am very agry','this is not interesting','I am very happy']
sents = ['angry','indiferent','happy']

new_comments = ['I love the life','I hate you','this is not important']
new_sents = ['happy','angry','indiferent']

tfidf_vectorizer = TfidfVectorizer(analyzer='word')
le = preprocessing.LabelEncoder()

# The below lines are important

# I have given the whole data to fit in tfidf_vectorizer
tfidf_vectorizer.fit(comments + new_comments)

# same for `sents`, but since the labels dont change, it doesnt matter which you use, because it will be same
# le.fit(sents)
le.fit(sents + new_sents) 
正确的方法,或拟采用的部分配合方式:

tfidf = tfidf_vectorizer.transform(comments)
labels = le.transform(sents)

clf2.fit(tfidf, labels)
print(clf2.predict(tfidf))
# [0 2 1]

new_tfidf = tfidf_vectorizer.transform(new_comments)
new_labels = le.transform(new_sents)

clf2.partial_fit(new_tfidf, new_labels)
print(clf2.predict(new_tfidf))
# [1 0 2]     As you wanted

如何安装clf2。请将整个代码作为一个代码片段发布。现在一次又一次地复制粘贴是非常烦人的。@VivekKumar我已经更新了问题,我添加了完整的代码来重现我的问题,感谢支持非常感谢支持我终于克服了这种情况