Machine learning 提高sklearn中模型的准确性

Machine learning 提高sklearn中模型的准确性,machine-learning,scikit-learn,classification,decision-tree,sklearn-pandas,Machine Learning,Scikit Learn,Classification,Decision Tree,Sklearn Pandas,决策树分类的准确度为0.52,但我想提高准确度。如何使用sklearn中提供的任何分类模型来提高准确性 我使用了knn、决策树和交叉验证,但它们的准确性都较低 谢谢 import pandas as pd from sklearn.model_selection import cross_val_score from sklearn.model_selection import train_test_split from sklearn.neighbors import KNeighborsCl

决策树分类的准确度为0.52,但我想提高准确度。如何使用sklearn中提供的任何分类模型来提高准确性

我使用了knn、决策树和交叉验证,但它们的准确性都较低

谢谢

import pandas as pd
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier

#read from the csv file and return a Pandas DataFrame.
nba = pd.read_csv('wine.csv')

# print the column names
original_headers = list(nba.columns.values)
print(original_headers)

#print the first three rows.
print(nba[0:3])

# "Position (pos)" is the class attribute we are predicting. 
class_column = 'quality'

#The dataset contains attributes such as player name and team name. 
#We know that they are not useful for classification and thus do not 
#include them as features. 
feature_columns = ['fixed acidity', 'volatile acidity', 'citric     acid', 'residual sugar', 'chlorides', 'free sulfur dioxide', 'total sulfur         dioxide', 'density', 'pH','sulphates', 'alcohol']

#Pandas DataFrame allows you to select columns. 
#We use column selection to split the data into features and class. 
nba_feature = nba[feature_columns]
nba_class = nba[class_column]

print(nba_feature[0:3])
print(list(nba_class[0:3]))

train_feature, test_feature, train_class, test_class = \
train_test_split(nba_feature, nba_class, stratify=nba_class, \
train_size=0.75, test_size=0.25)

training_accuracy = []
test_accuracy = []

knn = KNeighborsClassifier(n_neighbors=5, metric='minkowski', p=1)
knn.fit(train_feature, train_class)
prediction = knn.predict(test_feature)
print("Test set predictions:\n{}".format(prediction))
print("Test set accuracy: {:.2f}".format(knn.score(test_feature, test_class)))

train_class_df = pd.DataFrame(train_class,columns=[class_column])     
train_data_df = pd.merge(train_class_df, train_feature, left_index=True, right_index=True)
train_data_df.to_csv('train_data.csv', index=False)

temp_df = pd.DataFrame(test_class,columns=[class_column])
temp_df['Predicted Pos']=pd.Series(prediction, index=temp_df.index)
test_data_df = pd.merge(temp_df, test_feature, left_index=True, right_index=True)
test_data_df.to_csv('test_data.csv', index=False)

tree = DecisionTreeClassifier(max_depth=4, random_state=0)
tree.fit(train_feature, train_class)
print("Training set score: {:.3f}".format(tree.score(train_feature, train_class)))
print("Test set score Decision: {:.3f}".format(tree.score(test_feature, test_class)))

prediction = tree.predict(test_feature)
print("Confusion matrix:")
print(pd.crosstab(test_class, prediction, rownames=['True'], colnames=['Predicted'], margins=True))
cancer = nba.as_matrix()
tree = DecisionTreeClassifier(max_depth=4, random_state=0)

scores = cross_val_score(tree, train_feature,train_class, cv=10)
print("Cross-validation scores: {}".format(scores))
print("Average cross-validation score: {:.2f}".format(scores.mean()))

通常,DT之后的下一步是RF(及其邻居)或XGBoost(但不是sklearn)。试试看。和DT非常容易过度拟合

删除异常值。检查数据集中的类:如果它们是不平衡的,大多数错误可能都存在。在这种情况下,需要在拟合时或在公制函数中使用权重(或使用f1)

你可以在这里附上你的混淆矩阵-很高兴看到


此外,NN(甚至来自sklearn)可能会显示更好的结果。

改进预处理


DT和kNN等方法可能对列的预处理方式很敏感。例如,DT可以从连续变量上精心选择的阈值中获益匪浅。

向我们展示一个数据集“球员姓名和球队名称等属性”的示例,该数据集确实是一个有趣的wine.csv。