Python Scikit:使用cross_val_score函数计算精度和召回率

Python Scikit:使用cross_val_score函数计算精度和召回率,python,machine-learning,scikit-learn,precision,logistic-regression,Python,Machine Learning,Scikit Learn,Precision,Logistic Regression,我正在使用scikit对垃圾邮件/火腿数据进行逻辑回归。 X_train是我的培训数据,y_train是标签(“垃圾邮件”或“火腿”),我通过以下方式培训我的物流回归: classifier = LogisticRegression() classifier.fit(X_train, y_train) precision = cross_val_score(classifier, X_train, y_train, cv=10, scoring='precision') recall = cr

我正在使用scikit对垃圾邮件/火腿数据进行逻辑回归。 X_train是我的培训数据,y_train是标签(“垃圾邮件”或“火腿”),我通过以下方式培训我的物流回归:

classifier = LogisticRegression()
classifier.fit(X_train, y_train)
precision = cross_val_score(classifier, X_train, y_train, cv=10, scoring='precision')
recall = cross_val_score(classifier, X_train, y_train, cv=10, scoring='recall')
from sklearn import preprocessing
lb = preprocessing.LabelBinarizer()
lb.fit(y_train)
如果我想获得10倍交叉验证的精度,我只需写:

 accuracy = cross_val_score(classifier, X_train, y_train, cv=10)
我认为通过这样简单地添加一个参数,也可以计算精度和召回率:

classifier = LogisticRegression()
classifier.fit(X_train, y_train)
precision = cross_val_score(classifier, X_train, y_train, cv=10, scoring='precision')
recall = cross_val_score(classifier, X_train, y_train, cv=10, scoring='recall')
from sklearn import preprocessing
lb = preprocessing.LabelBinarizer()
lb.fit(y_train)
但它会导致
值错误

ValueError: pos_label=1 is not a valid label: array(['ham', 'spam'], dtype='|S4') 
它是否与数据相关(我应该对标签进行二值化吗?)或者它们是否更改了
交叉值(cross_val_score)
函数


提前谢谢你

上面显示的语法是正确的。看起来您使用的数据有问题。标签不需要进行二值化,只要它们不是连续的数字

您可以用不同的数据集证明相同的语法:

iris = sklearn.dataset.load_iris()
X_train = iris['data']
y_train = iris['target']

classifier = LogisticRegression()
classifier.fit(X_train, y_train)

print cross_val_score(classifier, X_train, y_train, cv=10, scoring='precision')
print cross_val_score(classifier, X_train, y_train, cv=10, scoring='recall')

为了计算查全率和查准率,数据必须确实进行二值化,方法如下:

classifier = LogisticRegression()
classifier.fit(X_train, y_train)
precision = cross_val_score(classifier, X_train, y_train, cv=10, scoring='precision')
recall = cross_val_score(classifier, X_train, y_train, cv=10, scoring='recall')
from sklearn import preprocessing
lb = preprocessing.LabelBinarizer()
lb.fit(y_train)
更进一步说,我很惊讶,当我想要计算精度时,我不必对数据进行二值化:

accuracy = cross_val_score(classifier, X_train, y_train, cv=10)

这只是因为准确度公式并不需要关于哪一类被认为是正的或负的信息:(TP+TN)/(TP+TN+FN+FP)。我们确实可以看到TP和TN是可交换的,召回率、精确度和f1则不是这样。

我在这里遇到了同样的问题,我用

# precision, recall and F1
from sklearn.preprocessing import LabelBinarizer

lb = LabelBinarizer()
y_train = np.array([number[0] for number in lb.fit_transform(y_train)])

recall = cross_val_score(classifier, X_train, y_train, cv=5, scoring='recall')
print('Recall', np.mean(recall), recall)
precision = cross_val_score(classifier, X_train, y_train, cv=5, scoring='precision')
print('Precision', np.mean(precision), precision)
f1 = cross_val_score(classifier, X_train, y_train, cv=5, scoring='f1')
print('F1', np.mean(f1), f1)

您可以使用如下交叉验证来获得f1分数和召回率:

print('10-fold cross validation:\n')
start_time = time()
scores = cross_validation.cross_val_score(clf, X,y, cv=10, scoring ='f1')
recall_score=cross_validation.cross_val_score(clf, X,y, cv=10, scoring ='recall')
print(label+" f1: %0.2f (+/- %0.2f) [%s]" % (scores.mean(), scores.std(), 'DecisionTreeClassifier'))
print("---Classifier %s use %s seconds ---" %('DecisionTreeClassifier', (time() - start_time)))
有关更多评分参数,请参见