Python 如何有效地比较所有模型的准确性
我已经分割了训练数据并初始化了11个分类器模型,现在我想对它们进行比较 我在Ubuntu 18.04上运行VS代码 我试过:Python 如何有效地比较所有模型的准确性,python,pandas,scikit-learn,Python,Pandas,Scikit Learn,我已经分割了训练数据并初始化了11个分类器模型,现在我想对它们进行比较 我在Ubuntu 18.04上运行VS代码 我试过: # Prepare lists models = [ran, knn, log, xgb, gbc, svc, ext, ada, gnb, gpc, bag] scores = [] # Sequentially fit and cross validate all models for mod in models: mod.fit(X_tr
# Prepare lists
models = [ran, knn, log, xgb, gbc, svc, ext, ada, gnb, gpc, bag]
scores = []
# Sequentially fit and cross validate all models
for mod in models:
mod.fit(X_train, y_train)
acc = cross_val_score(mod, X_train, y_train, scoring =
"accuracy", cv = 10)
scores.append(acc.mean())
# Creating a table of results, ranked highest to lowest
results = pd.DataFrame({
'Model': ['Random Forest', 'K Nearest Neighbour', 'Logistic
Regression', 'XGBoost', 'Gradient Boosting', 'SVC', 'Extra
Trees', 'AdaBoost', 'Gaussian Naive Bayes', 'Gaussian Process',
'Bagging Classifier'],
'Score': scores})
返回最后一部分:
ValueError:数组的长度必须相同
我已经计算了2倍,实际上有11个模型
我缺少什么?您的代码中似乎有缩进错误,请参见下面编辑的代码。在代码中,如果执行
len(scores)
操作,将得到1
,因为在循环外调用append时,只添加最后一个值
# Prepare lists
models = [ran, knn, log, xgb, gbc, svc, ext, ada, gnb, gpc, bag]
scores = []
# Sequentially fit and cross validate all models
for mod in models:
mod.fit(X_train, y_train)
acc = cross_val_score(mod, X_train, y_train, scoring =
"accuracy", cv = 10)
scores.append(acc.mean())
在对上一个答案进行了投票之后,我继续证明错误确实是由于您的
分数。append()
超出了您的for
循环:
我们不需要实际适合任何模型;我们可以通过对代码进行以下修改来模拟这种情况,这些修改不会改变问题的本质:
import numpy as np
import pandas as pd
models = ['ran', 'knn', 'log', 'xgb', 'gbc', 'svc', 'ext', 'ada', 'gnb', 'gpc', 'bag']
scores = []
cv=10
# Sequentially fit and cross validate all models
for mod in models:
acc = np.array([np.random.rand() for i in range(cv)]) # simulate your accuracy here
scores.append(acc.mean()) # as in your code, i.e outside the for loop
# Create a dataframe of results
results = pd.DataFrame({
'Model': ['Random Forest', 'K Nearest Neighbour', 'Logistic Regression', 'XGBoost', 'Gradient Boosting',
'SVC', 'Extra Trees', 'AdaBoost', 'Gaussian Naive Bayes', 'Gaussian Process', 'Bagging Classifier'],
'Score': scores})
for mod in models:
acc = np.array([np.random.rand() for i in range(cv)])
scores.append(acc.mean()) # moved inside the loop
# Create a dataframe of results
results = pd.DataFrame({
'Model': ['Random Forest', 'K Nearest Neighbour', 'Logistic Regression', 'XGBoost', 'Gradient Boosting',
'SVC', 'Extra Trees', 'AdaBoost', 'Gaussian Naive Bayes', 'Gaussian Process', 'Bagging Classifier'],
'Score': scores})
print(results)
# output:
Model Score
0 Random Forest 0.492364
1 K Nearest Neighbour 0.624068
2 Logistic Regression 0.613653
3 XGBoost 0.536488
4 Gradient Boosting 0.484195
5 SVC 0.381556
6 Extra Trees 0.274922
7 AdaBoost 0.509297
8 Gaussian Naive Bayes 0.362866
9 Gaussian Process 0.606538
10 Bagging Classifier 0.393950
毫不奇怪,这实际上复制了您的错误:
ValueError: arrays must all be same length
因为,正如在另一个答案中已经指出的,您的分数
列表只有一个元素,即仅来自循环最后一次迭代的acc.mean()
:
len(scores)
# 1
scores
# [0.47317491043203785]
因此熊猫抱怨,因为它无法填充11行数据帧
在for
循环中移动scores.append()
,正如在另一个答案中所建议的那样,解决了以下问题:
import numpy as np
import pandas as pd
models = ['ran', 'knn', 'log', 'xgb', 'gbc', 'svc', 'ext', 'ada', 'gnb', 'gpc', 'bag']
scores = []
cv=10
# Sequentially fit and cross validate all models
for mod in models:
acc = np.array([np.random.rand() for i in range(cv)]) # simulate your accuracy here
scores.append(acc.mean()) # as in your code, i.e outside the for loop
# Create a dataframe of results
results = pd.DataFrame({
'Model': ['Random Forest', 'K Nearest Neighbour', 'Logistic Regression', 'XGBoost', 'Gradient Boosting',
'SVC', 'Extra Trees', 'AdaBoost', 'Gaussian Naive Bayes', 'Gaussian Process', 'Bagging Classifier'],
'Score': scores})
for mod in models:
acc = np.array([np.random.rand() for i in range(cv)])
scores.append(acc.mean()) # moved inside the loop
# Create a dataframe of results
results = pd.DataFrame({
'Model': ['Random Forest', 'K Nearest Neighbour', 'Logistic Regression', 'XGBoost', 'Gradient Boosting',
'SVC', 'Extra Trees', 'AdaBoost', 'Gaussian Naive Bayes', 'Gaussian Process', 'Bagging Classifier'],
'Score': scores})
print(results)
# output:
Model Score
0 Random Forest 0.492364
1 K Nearest Neighbour 0.624068
2 Logistic Regression 0.613653
3 XGBoost 0.536488
4 Gradient Boosting 0.484195
5 SVC 0.381556
6 Extra Trees 0.274922
7 AdaBoost 0.509297
8 Gaussian Naive Bayes 0.362866
9 Gaussian Process 0.606538
10 Bagging Classifier 0.393950
您可能还需要记住,您不需要代码中的
model.fit()
部分-cross\u val\u score
进行所有必要的拟合本身…错误确切地出现在哪里?请包含完整的错误跟踪…@desertnaut返回熊猫数据帧的错误。您是否检查了以下答案(即移动分数。追加()
与for
循环的其余部分内联)?欢迎使用SO;如果其中一个答案解决了你的问题,请接受它(请看)没有一个答案好到可以接受?不,这不是问题所在。返回代码最后一部分pandas dataframe的错误。@StanislavJirak这并不意味着答案是错误的;正如您的代码所示,分数
变成了一个单一元素列表(即,您只从for循环中附加最后一个acc.mean()
),这确实会产生您报告的错误;请包含完整的错误跟踪…@StanislavJirak,您可以检查代码中分数的长度,该长度将为1
。您试图创建一个数据帧,其中一列中有11
条目,另一列中有1
条目,这会引发错误。如果抛出错误,则会发生该错误,因为列的长度不正确。我现在明白了。但是如何创建一个包含11列的空数组呢?我试过密码,你什么意思?为什么要创建空数组?这里建议的补救办法解决了您的问题,请参见我的答案以获得佐证…,很好的解释@StanislavJirak希望您理解代码的错误。