Python scikit了解输出度量值。将报告转换为CSV/制表符分隔格式
我正在用Scikit Learn做一个多类文本分类。使用具有数百个标签的多项式朴素贝叶斯分类器对数据集进行训练。下面是Scikit学习脚本的摘录,用于拟合MNB模型Python scikit了解输出度量值。将报告转换为CSV/制表符分隔格式,python,csv,text,scikit-learn,classification,Python,Csv,Text,Scikit Learn,Classification,我正在用Scikit Learn做一个多类文本分类。使用具有数百个标签的多项式朴素贝叶斯分类器对数据集进行训练。下面是Scikit学习脚本的摘录,用于拟合MNB模型 from __future__ import print_function # Read **`file.csv`** into a pandas DataFrame import pandas as pd path = 'data/file.csv' merged = pd.read_csv(path, error_bad_l
from __future__ import print_function
# Read **`file.csv`** into a pandas DataFrame
import pandas as pd
path = 'data/file.csv'
merged = pd.read_csv(path, error_bad_lines=False, low_memory=False)
# define X and y using the original DataFrame
X = merged.text
y = merged.grid
# split X and y into training and testing sets;
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
# import and instantiate CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()
# create document-term matrices using CountVectorizer
X_train_dtm = vect.fit_transform(X_train)
X_test_dtm = vect.transform(X_test)
# import and instantiate MultinomialNB
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()
# fit a Multinomial Naive Bayes model
nb.fit(X_train_dtm, y_train)
# make class predictions
y_pred_class = nb.predict(X_test_dtm)
# generate classification report
from sklearn import metrics
print(metrics.classification_report(y_test, y_pred_class))
命令行屏幕上metrics.classification_报告的简化输出如下所示:
precision recall f1-score support
12 0.84 0.48 0.61 2843
13 0.00 0.00 0.00 69
15 1.00 0.19 0.32 232
16 0.75 0.02 0.05 965
33 1.00 0.04 0.07 155
4 0.59 0.34 0.43 5600
41 0.63 0.49 0.55 6218
42 0.00 0.00 0.00 102
49 0.00 0.00 0.00 11
5 0.90 0.06 0.12 2010
50 0.00 0.00 0.00 5
51 0.96 0.07 0.13 1267
58 1.00 0.01 0.02 180
59 0.37 0.80 0.51 8127
7 0.91 0.05 0.10 579
8 0.50 0.56 0.53 7555
avg/total 0.59 0.48 0.45 35919
我想知道是否有任何方法可以将报告输出转换成具有常规列标题的标准csv文件
当我将命令行输出发送到csv文件或尝试将屏幕输出复制/粘贴到电子表格Openoffice Calc或Excel中时,它会将结果集中到一列中。看起来像这样:
precision recall f1-score support
12 0.84 0.48 0.61 2843
13 0.00 0.00 0.00 69
15 1.00 0.19 0.32 232
16 0.75 0.02 0.05 965
33 1.00 0.04 0.07 155
4 0.59 0.34 0.43 5600
41 0.63 0.49 0.55 6218
42 0.00 0.00 0.00 102
49 0.00 0.00 0.00 11
5 0.90 0.06 0.12 2010
50 0.00 0.00 0.00 5
51 0.96 0.07 0.13 1267
58 1.00 0.01 0.02 180
59 0.37 0.80 0.51 8127
7 0.91 0.05 0.10 579
8 0.50 0.56 0.53 7555
avg/total 0.59 0.48 0.45 35919
我解决输出问题的方式就像我在前面的评论中提到的,我已经将输出转换为数据帧。不仅发送到文件()非常容易,而且操作数据结构也非常容易。我解决这个问题的另一种方法是使用
writerow
逐行编写输出
如果您设法将输出放入数据帧,则
dataframe_name_here.to_csv()
或者,如果使用CSV,它将类似于CSV链接中提供的示例。如果您想要个人分数,这应该可以很好地完成工作
import pandas as pd
def classification_report_csv(report):
report_data = []
lines = report.split('\n')
for line in lines[2:-3]:
row = {}
row_data = line.split(' ')
row['class'] = row_data[0]
row['precision'] = float(row_data[1])
row['recall'] = float(row_data[2])
row['f1_score'] = float(row_data[3])
row['support'] = float(row_data[4])
report_data.append(row)
dataframe = pd.DataFrame.from_dict(report_data)
dataframe.to_csv('classification_report.csv', index = False)
report = classification_report(y_true, y_pred)
classification_report_csv(report)
我们可以从
precision\u recall\u fscore\u support
函数中获得实际值,然后将它们放入数据帧中。
下面的代码将给出相同的结果,但现在在一个数据帧中:
clf_rep = metrics.precision_recall_fscore_support(true, pred)
out_dict = {
"precision" :clf_rep[0].round(2)
,"recall" : clf_rep[1].round(2)
,"f1-score" : clf_rep[2].round(2)
,"support" : clf_rep[3]
}
out_df = pd.DataFrame(out_dict, index = nb.classes_)
avg_tot = (out_df.apply(lambda x: round(x.mean(), 2) if x.name!="support" else round(x.sum(), 2)).to_frame().T)
avg_tot.index = ["avg/total"]
out_df = out_df.append(avg_tot)
print out_df
返回一个numpy数组,该数组可以转换为pandas dataframe或仅保存为csv文件。虽然前面的答案可能都有效,但我发现它们有点冗长。以下内容将单个类结果以及摘要行存储在单个数据帧中。对报告中的变化不太敏感,但却帮了我的忙
#init snippet and fake data
from io import StringIO
import re
import pandas as pd
from sklearn import metrics
true_label = [1,1,2,2,3,3]
pred_label = [1,2,2,3,3,1]
def report_to_df(report):
report = re.sub(r" +", " ", report).replace("avg / total", "avg/total").replace("\n ", "\n")
report_df = pd.read_csv(StringIO("Classes" + report), sep=' ', index_col=0)
return(report_df)
#txt report to df
report = metrics.classification_report(true_label, pred_label)
report_df = report_to_df(report)
#store, print, copy...
print (report_df)
这将提供所需的输出:
Classes precision recall f1-score support
1 0.5 0.5 0.5 2
2 0.5 0.5 0.5 2
3 0.5 0.5 0.5 2
avg/total 0.5 0.5 0.5 6
另一种选择是计算基础数据并自行编写报告。你将得到的所有统计数据
precision_recall_fscore_support
这是我的2类(pos、neg)分类代码
report = metrics.precision_recall_fscore_support(true_labels,predicted_labels,labels=classes)
rowDicionary["precision_pos"] = report[0][0]
rowDicionary["recall_pos"] = report[1][0]
rowDicionary["f1-score_pos"] = report[2][0]
rowDicionary["support_pos"] = report[3][0]
rowDicionary["precision_neg"] = report[0][1]
rowDicionary["recall_neg"] = report[1][1]
rowDicionary["f1-score_neg"] = report[2][1]
rowDicionary["support_neg"] = report[3][1]
writer = csv.DictWriter(file, fieldnames=fieldnames)
writer.writerow(rowDicionary)
正如在这里的一篇文章中所提到的,
precision\u recall\u fscore\u support
类似于classification\u report
然后,使用pandas就可以轻松地将数据格式化为列格式,类似于classification\u report
所做的。以下是一个例子:
import numpy as np
import pandas as pd
from sklearn.metrics import classification_report
from sklearn.metrics import precision_recall_fscore_support
np.random.seed(0)
y_true = np.array([0]*400 + [1]*600)
y_pred = np.random.randint(2, size=1000)
def pandas_classification_report(y_true, y_pred):
metrics_summary = precision_recall_fscore_support(
y_true=y_true,
y_pred=y_pred)
avg = list(precision_recall_fscore_support(
y_true=y_true,
y_pred=y_pred,
average='weighted'))
metrics_sum_index = ['precision', 'recall', 'f1-score', 'support']
class_report_df = pd.DataFrame(
list(metrics_summary),
index=metrics_sum_index)
support = class_report_df.loc['support']
total = support.sum()
avg[-1] = total
class_report_df['avg / total'] = avg
return class_report_df.T
使用classification\u报告
您将得到如下结果:
print(classification_report(y_true=y_true, y_pred=y_pred, digits=6))
输出:
precision recall f1-score support
0 0.379032 0.470000 0.419643 400
1 0.579365 0.486667 0.528986 600
avg / total 0.499232 0.480000 0.485248 1000
precision recall f1-score support
0 0.379032 0.470000 0.419643 400.0
1 0.579365 0.486667 0.528986 600.0
avg / total 0.499232 0.480000 0.485248 1000.0
然后使用我们的定制功能熊猫分类报告
:
df_class_report = pandas_classification_report(y_true=y_true, y_pred=y_pred)
print(df_class_report)
输出:
precision recall f1-score support
0 0.379032 0.470000 0.419643 400
1 0.579365 0.486667 0.528986 600
avg / total 0.499232 0.480000 0.485248 1000
precision recall f1-score support
0 0.379032 0.470000 0.419643 400.0
1 0.579365 0.486667 0.528986 600.0
avg / total 0.499232 0.480000 0.485248 1000.0
然后只需将其保存为csv格式(请参阅以了解其他分隔符格式,如sep=';'):
我使用LibreOffice Calc打开my_csv_file.csv
(尽管您可以使用任何表格/电子表格编辑器,如excel):
我还发现一些答案有点冗长。这是我的三行解决方案,使用其他人建议的
precision\u recall\u fscore\u support
import pandas as pd
from sklearn.metrics import precision_recall_fscore_support
report = pd.DataFrame(list(precision_recall_fscore_support(y_true, y_pred)),
index=['Precision', 'Recall', 'F1-score', 'Support']).T
# Now add the 'Avg/Total' row
report.loc['Avg/Total', :] = precision_recall_fscore_support(y_true, y_test,
average='weighted')
report.loc['Avg/Total', 'Support'] = report['Support'].sum()
除了示例输入输出外,还有另一个函数metrics\u report\u to\u df()。从Sklearn metrics实施精确性\u召回\u fscore\u支持应做到:
# Generates classification metrics using precision_recall_fscore_support:
from sklearn import metrics
import pandas as pd
import numpy as np; from numpy import random
# Simulating true and predicted labels as test dataset:
np.random.seed(10)
y_true = np.array([0]*300 + [1]*700)
y_pred = np.random.randint(2, size=1000)
# Here's the custom function returning classification report dataframe:
def metrics_report_to_df(ytrue, ypred):
precision, recall, fscore, support = metrics.precision_recall_fscore_support(ytrue, ypred)
classification_report = pd.concat(map(pd.DataFrame, [precision, recall, fscore, support]), axis=1)
classification_report.columns = ["precision", "recall", "f1-score", "support"] # Add row w "avg/total"
classification_report.loc['avg/Total', :] = metrics.precision_recall_fscore_support(ytrue, ypred, average='weighted')
classification_report.loc['avg/Total', 'support'] = classification_report['support'].sum()
return(classification_report)
# Provide input as true_label and predicted label (from classifier)
classification_report = metrics_report_to_df(y_true, y_pred)
# Here's the output (metrics report transformed to dataframe )
In [1047]: classification_report
Out[1047]:
precision recall f1-score support
0 0.300578 0.520000 0.380952 300.0
1 0.700624 0.481429 0.570703 700.0
avg/Total 0.580610 0.493000 0.513778 1000.0
从
scikit learn
v0.20开始,将分类报告转换为pandas
数据帧的最简单方法是将报告作为dict
返回:
report = classification_report(y_test, y_pred, output_dict=True)
然后构造一个数据帧并将其转置:
df = pandas.DataFrame(report).transpose()
从这里开始,您可以自由使用标准的pandas
方法来生成所需的输出格式(CSV、HTML、LaTeX等)
参见。我修改了@kindjacket的答案。 试试这个:
import collections
def classification_report_df(report):
report_data = []
lines = report.split('\n')
del lines[-5]
del lines[-1]
del lines[1]
for line in lines[1:]:
row = collections.OrderedDict()
row_data = line.split()
row_data = list(filter(None, row_data))
row['class'] = row_data[0] + " " + row_data[1]
row['precision'] = float(row_data[2])
row['recall'] = float(row_data[3])
row['f1_score'] = float(row_data[4])
row['support'] = int(row_data[5])
report_data.append(row)
df = pd.DataFrame.from_dict(report_data)
df.set_index('class', inplace=True)
return df
您可以使用pandas将df导出为csv,我编写了以下代码来提取分类报告并将其保存到excel文件中:
def classifcation_report_processing(model_to_report):
tmp = list()
for row in model_to_report.split("\n"):
parsed_row = [x for x in row.split(" ") if len(x) > 0]
if len(parsed_row) > 0:
tmp.append(parsed_row)
# Store in dictionary
measures = tmp[0]
D_class_data = defaultdict(dict)
for row in tmp[1:]:
class_label = row[0]
for j, m in enumerate(measures):
D_class_data[class_label][m.strip()] = float(row[j + 1].strip())
save_report = pd.DataFrame.from_dict(D_class_data).T
path_to_save = os.getcwd() +'/Classification_report.xlsx'
save_report.to_excel(path_to_save, index=True)
return save_report.head(5)
要调用下面的函数,可以在程序中的任何位置使用第行:
saving_CL_report_naive_bayes = classifcation_report_processing(classification_report(y_val, prediction))
输出如下所示:
我也有同样的问题,我所做的是,将metrics.classification\u report的字符串输出粘贴到google sheets或excel中,并使用自定义的5个空格将文本拆分为列。只需
导入熊猫作为pd
,并确保在计算分类报告
。这将产生一个classification\u报告字典
,然后您可以将其传递给DataFrame
方法。您可能需要转置
生成的数据帧
,以适应所需的输出格式。然后,可以根据需要将生成的DataFrame
写入csv
文件
clsf_report = pd.DataFrame(classification_report(y_true = your_y_true, y_pred = your_y_preds5, output_dict=True)).transpose()
clsf_report.to_csv('Your Classification Report Name.csv', index= True)
将分类报告输出为dict显然是一个更好的主意:
sklearn.metrics.classification_report(y_true, y_pred, output_dict=True)
但我做了一个函数,用于将所有类(仅类)结果转换为数据帧
def report_to_df(report):
report = [x.split(' ') for x in report.split('\n')]
header = ['Class Name']+[x for x in report[0] if x!='']
values = []
for row in report[1:-5]:
row = [value for value in row if value!='']
if row!=[]:
values.append(row)
df = pd.DataFrame(data = values, columns = header)
return df
我找到的最简单、最好的方法是:
classes = ['class 1','class 2','class 3']
report = classification_report(Y[test], Y_pred, target_names=classes)
report_path = "report.txt"
text_file = open(report_path, "w")
n = text_file.write(report)
text_file.close()
绝对值得使用:
sklearn.metrics.classification_report(y_true, y_pred, output_dict=True)
但对函数的稍微修改版本如下。该函数包括精度、宏精度和加权精度行以及类:
def classification_report_to_dataframe(str_representation_of_report):
split_string = [x.split(' ') for x in str_representation_of_report.split('\n')]
column_names = ['']+[x for x in split_string[0] if x!='']
values = []
for table_row in split_string[1:-1]:
table_row = [value for value in table_row if value!='']
if table_row!=[]:
values.append(table_row)
for i in values:
for j in range(len(i)):
if i[1] == 'avg':
i[0:2] = [' '.join(i[0:2])]
if len(i) == 3:
i.insert(1,np.nan)
i.insert(2, np.nan)
else:
pass
report_to_df = pd.DataFrame(data=values, columns=column_names)
return report_to_df
可能会找到测试分类报告的输出我会在键入该报告时尝试重新创建结果,但您是否尝试过使用Pandas将表转换为数据框,然后使用
DataFrame\u name\u将数据框发送到csv。发送到\u csv()
?您还可以显示将结果写入csv的代码吗?@MattR我已经编辑了问题并提供了完整的python代码…我正在从Linux命令行将脚本输出传递到csv文件,因此:$python3 script.py>result.csvthanks我尝试使用数据帧<代码>结果=度量。分类报告(y_测试,y_预测类);df=pd.数据帧(结果);df.to_csv(results.csv,sep='\t')但出现错误pandas.core.common.PandasError:DataFrame