Python 为什么sklearn cross_val_分数这么低？_Python_Machine Learning_Scikit Learn_Cross Validation

Python 为什么sklearn cross_val_分数这么低？

python machine-learning scikit-learn

Python 为什么sklearn cross_val_分数这么低？,python,machine-learning,scikit-learn,cross-validation,Python,Machine Learning,Scikit Learn,Cross Validation,好的，在这里尝试获得4种不同算法的交叉值。我的数据框如下所示： target type post 1 intj "hello world shdjd" 2 entp "hello world fddf" 16 estj "hello world dsd" 4 esfp "hello world sfs" 1 intj "hello world ddfd" 其中，类型具有重复。我是这样计算交叉分数的： enco

好的，在这里尝试获得4种不同算法的交叉值。我的数据框如下所示：

target   type    post
1      intj    "hello world shdjd"
2      entp    "hello world fddf"
16     estj   "hello world dsd"
4      esfp    "hello world sfs"
1      intj    "hello world ddfd"

其中，

类型

具有重复。我是这样计算交叉分数的：

encoder = preprocessing.LabelEncoder()
y_encoded = encoder.fit_transform(result['type'])

train_x, valid_x, train_y, valid_y = model_selection.train_test_split(result['post'], y_encoded, test_size=0.30, random_state=1)

models = {'lr':LogisticRegression(multi_class = 'multinomial', solver = 'newton-cg'),
          'nb':MultinomialNB(alpha = 0.0001),
          'sgd':SGDClassifier(loss='hinge', penalty='l2',alpha=1e-3, random_state=42,
                      max_iter=5, tol=None),
          'rf':RandomForestClassifier(n_estimators = 10)}

for name,clf in models.items():
    pipe = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', clf)])

    res = cross_val_score(pipe,result.post,result.target,cv=10, n_jobs=8)
    print(name,res.mean(),res.std())

这是可行的，但是平均值都在0.3左右。所有的实际准确度约为0.98，逻辑回归的实际准确度约为0.7

这里怎么了

编辑-以下是我如何知道每个算法的平均精度高于0.3（我对每个算法都这样做）：

在

for

循环中的模型中，测量模型在交叉验证分区上的执行情况。在手动编辑中，您可以测量您在

docs\u test

上的表现。通常，你期望你的简历分数与你在样本外测试集上的表现相似。如果您在测试集上的表现相当好，那么可能不是随机创建的

docs\u test

。您可能有目标泄漏。也许模型恰好能够很好地预测该测试集。

在

for

循环中的模型中，您可以测量模型在交叉验证分区上的执行情况。在手动编辑中，您可以测量您在

docs\u test

上的表现。通常，你期望你的简历分数与你在样本外测试集上的表现相似。如果您在测试集上的表现相当好，那么可能不是随机创建的

docs\u test

。您可能有目标泄漏。也许该模型恰好能很好地预测该测试集。

你说的“实际准确度”是什么意思？您正在打印4个建模管道的平均精度分数。@thomaskolasa请参阅我的编辑您有多少行？您的编辑没有10倍的CV，所以它有10倍多的示例可供学习。@thomaskolasa我有2000个。老实说，我对这一切都是新的-我应该改变这里的折叠次数吗？对不起，我上面的评论是不正确的。10倍CV的每个分区对90%的数据进行排序。你说的“实际精度”是什么意思？您正在打印4个建模管道的平均精度分数。@thomaskolasa请参阅我的编辑您有多少行？您的编辑没有10倍的CV，所以它有10倍多的示例可供学习。@thomaskolasa我有2000个。老实说，我对这一切都是新的-我应该改变这里的折叠次数吗？对不起，我上面的评论是不正确的。90%数据上10倍CV序列的每个分区。Ok。考虑到这些精度-svm（0.97）、朴素贝叶斯（0.95）和随机森林（0.98）逻辑回归（0.7），正常交叉验证意味着什么？在for循环中，

pipe.predict（docs\u test）

对

docs\u test

的执行情况如何？这将为您提供与在

docs\u test

上手动进行预测时相同的结果。确定。考虑到这些精度-svm（0.97）、朴素贝叶斯（0.95）和随机森林（0.98）逻辑回归（0.7），正常交叉验证意味着什么？在for循环中，

pipe.predict（docs\u test）

对

docs\u test

的执行情况如何？这将为您提供与在

docs\u test

上手动进行预测时相同的结果。

text_clf3 = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', LogisticRegression(multi_class = 'multinomial', solver = 'newton-cg')),
])

text_clf3.fit(result.post, result.target)

predicted3 = text_clf3.predict(docs_test)
print("Logistics Regression: ")
print(np.mean(predicted3 == result.target))
print(metrics.classification_report(result.target, predicted3))

print(confusion_matrix(result.target, predicted3))
print("LR Precision:",precision_score(result.target, predicted3, average='weighted'))
print("LR Recall:",recall_score(result.target, predicted3, average='weighted'))