Python 我的所有机器学习模型都100%准确。我的模型怎么了
我正在处理一个数据集,它是5封手工制作的信件的集合。我已经在Kaggle上上传了DB,如果有人想看一下,请看 目前,我已经训练和测试了几个模型,但我一直保持100%的准确性 这是我的密码Python 我的所有机器学习模型都100%准确。我的模型怎么了,python,database,scikit-learn,Python,Database,Scikit Learn,我正在处理一个数据集,它是5封手工制作的信件的集合。我已经在Kaggle上上传了DB,如果有人想看一下,请看 目前,我已经训练和测试了几个模型,但我一直保持100%的准确性 这是我的密码 import numpy as np # linear algebra import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv) import seaborn as sns import matplotlib.pyplot a
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt
# importing alll the necessary packages to use the various classification algorithms
from sklearn.linear_model import LogisticRegression # for Logistic Regression algorithm
from sklearn.model_selection import train_test_split #to split the dataset for training and testing
from sklearn.neighbors import KNeighborsClassifier # for K nearest neighbours
from sklearn import svm #for Support Vector Machine (SVM) Algorithm
from sklearn import metrics #for checking the model accuracy
from sklearn.tree import DecisionTreeClassifier #for using Decision Tree Algoithm
from mpl_toolkits.mplot3d import Axes3D
import os # accessing directory structure
from subprocess import check_output
df = df.drop(['Id','Time', 'Wrist_Pitch','Wrist_Roll'],axis = 1)
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))
nRowsRead = None
df = pd.read_csv('/kaggle/input/ASL_DATA.csv', delimiter=',', nrows = nRowsRead)
df.dataframeName = 'ASL_DATA.csv'
nRow, nCol = df.shape
print(f'There are {nRow} rows and {nCol} columns')
plt.figure(figsize=(30,20))
sns.heatmap(df.corr(),annot=True,cmap='cubehelix_r') #draws heatmap with input as the correlation matrix calculted by(iris.corr())
plt.show()
train, test = train_test_split(df, test_size = 0.2)# in this our main data is split into train and test
# the attribute test_size=0.3 splits the data into 70% and 30% ratio. train=70% and test=30%
print(train.shape)
print(test.shape)
train_X = train[['Thumb_Pitch','Thumb_Roll','Index_Pitch','Index_Roll','Middle_Pitch','Middle_Roll','Ring_Pitch','Ring_Roll','Pinky_Pitch','Pinky_Roll']]# taking the training data features
train_y=train.Letter# output of our training data
test_X= test[['Thumb_Pitch','Thumb_Roll','Index_Pitch','Index_Roll','Middle_Pitch','Middle_Roll','Ring_Pitch','Ring_Roll','Pinky_Pitch','Pinky_Roll']] # taking test data features
test_y =test.Letter #output value of test data
from sklearn import preprocessing
mm_scaler = preprocessing.RobustScaler()
train_X = mm_scaler.fit_transform(train_X)
test_X = mm_scaler.transform(test_X)
model=DecisionTreeClassifier()
model.fit(train_X,train_y)
prediction=model.predict(test_X)
print('The accuracy of the Decision Tree is',metrics.accuracy_score(prediction,test_y))
model=DecisionTreeClassifier()
model.fit(train_X,train_y)
prediction=model.predict(test_X)
print('The accuracy of the Decision Tree is',metrics.accuracy_score(prediction,test_y))
model=KNeighborsClassifier(n_neighbors=) #this examines 3 neighbours for putting the new data into a class
model.fit(train_X,train_y)
prediction=model.predict(test_X)
print('The accuracy of the KNN is',metrics.accuracy_score(prediction,test_y))
您的模型没有问题,只是模型需要解决的一个小问题。当你考虑所有的特征时,这些字母看起来都不一样。如果您选择了所有的字母,或者所有看起来相同的字母,您可能会看到一些错误 仅使用变桨和变桨辊重新运行模型。您仍将获得大约95%的AUC。至少通过这样做,你可以猜到唯一的损失来自于B、D和K,通过观察这些东西的图像,你可以猜到,如果你只看食指的话,这三个东西是唯一一个可能会有点混淆的东西。事实证明是这样的
这只是一个问题,因为您的数据集实际上是可以解决的用于培训和测试的数据集如何。您是如何进行分离的?您是否尝试对train_X和test_X应用相同的转换?我使用了train_test_split函数,测试大小为0.2。我对应用哪些转换感到有点困惑,我使用了RobustScaler,因此异常值不会产生噪音。