Python 我的所有机器学习模型都100%准确。我的模型怎么了

Python 我的所有机器学习模型都100%准确。我的模型怎么了,python,database,scikit-learn,Python,Database,Scikit Learn,我正在处理一个数据集,它是5封手工制作的信件的集合。我已经在Kaggle上上传了DB,如果有人想看一下,请看 目前,我已经训练和测试了几个模型,但我一直保持100%的准确性 这是我的密码 import numpy as np # linear algebra import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv) import seaborn as sns import matplotlib.pyplot a

我正在处理一个数据集,它是5封手工制作的信件的集合。我已经在Kaggle上上传了DB,如果有人想看一下,请看

目前,我已经训练和测试了几个模型,但我一直保持100%的准确性

这是我的密码

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt
# importing alll the necessary packages to use the various classification algorithms
from sklearn.linear_model import LogisticRegression  # for Logistic Regression algorithm
from sklearn.model_selection import train_test_split #to split the dataset for training and testing
from sklearn.neighbors import KNeighborsClassifier  # for K nearest neighbours
from sklearn import svm  #for Support Vector Machine (SVM) Algorithm
from sklearn import metrics #for checking the model accuracy
from sklearn.tree import DecisionTreeClassifier #for using Decision Tree Algoithm
from mpl_toolkits.mplot3d import Axes3D
import os # accessing directory structure

from subprocess import check_output

df = df.drop(['Id','Time', 'Wrist_Pitch','Wrist_Roll'],axis = 1)
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

nRowsRead = None 

df = pd.read_csv('/kaggle/input/ASL_DATA.csv', delimiter=',', nrows = nRowsRead)

df.dataframeName = 'ASL_DATA.csv'
nRow, nCol = df.shape

print(f'There are {nRow} rows and {nCol} columns')

plt.figure(figsize=(30,20)) 
sns.heatmap(df.corr(),annot=True,cmap='cubehelix_r') #draws  heatmap with input as the correlation matrix calculted by(iris.corr())
plt.show()

train, test = train_test_split(df, test_size = 0.2)# in this our main data is split into train and test
# the attribute test_size=0.3 splits the data into 70% and 30% ratio. train=70% and test=30%
print(train.shape)
print(test.shape)

train_X = train[['Thumb_Pitch','Thumb_Roll','Index_Pitch','Index_Roll','Middle_Pitch','Middle_Roll','Ring_Pitch','Ring_Roll','Pinky_Pitch','Pinky_Roll']]# taking the training data features
train_y=train.Letter# output of our training data
test_X= test[['Thumb_Pitch','Thumb_Roll','Index_Pitch','Index_Roll','Middle_Pitch','Middle_Roll','Ring_Pitch','Ring_Roll','Pinky_Pitch','Pinky_Roll']] # taking test data features
test_y =test.Letter   #output value of test data

from sklearn import preprocessing
mm_scaler = preprocessing.RobustScaler()
train_X = mm_scaler.fit_transform(train_X)
test_X = mm_scaler.transform(test_X)


model=DecisionTreeClassifier()
model.fit(train_X,train_y)
prediction=model.predict(test_X)
print('The accuracy of the Decision Tree is',metrics.accuracy_score(prediction,test_y))


model=DecisionTreeClassifier()
model.fit(train_X,train_y)
prediction=model.predict(test_X)
print('The accuracy of the Decision Tree is',metrics.accuracy_score(prediction,test_y))

model=KNeighborsClassifier(n_neighbors=) #this examines 3 neighbours for putting the new data into a class
model.fit(train_X,train_y)
prediction=model.predict(test_X)
print('The accuracy of the KNN is',metrics.accuracy_score(prediction,test_y))


您的模型没有问题,只是模型需要解决的一个小问题。当你考虑所有的特征时,这些字母看起来都不一样。如果您选择了所有的字母,或者所有看起来相同的字母,您可能会看到一些错误

仅使用变桨和变桨辊重新运行模型。您仍将获得大约95%的AUC。至少通过这样做,你可以猜到唯一的损失来自于B、D和K,通过观察这些东西的图像,你可以猜到,如果你只看食指的话,这三个东西是唯一一个可能会有点混淆的东西。事实证明是这样的


这只是一个问题,因为您的数据集实际上是可以解决的

用于培训和测试的数据集如何。您是如何进行分离的?您是否尝试对train_X和test_X应用相同的转换?我使用了train_test_split函数,测试大小为0.2。我对应用哪些转换感到有点困惑,我使用了RobustScaler,因此异常值不会产生噪音。