Python 为什么我的两个分类器不能预测许多标签？_Python_Scikit Learn_Svm_Random Forest_Mlp

Python 为什么我的两个分类器不能预测许多标签？

python scikit-learn

Python 为什么我的两个分类器不能预测许多标签？,python,scikit-learn,svm,random-forest,mlp,Python,Scikit Learn,Svm,Random Forest,Mlp,我试着比较分类器RandomForest（RF）、SupportVectorMachine（SVM）和多层感知器（MLP），查看它们的分类报告。它是对分类数据的多分类。相同数据（378443个条目，7列）同样的是，相同的y_测试，。我检查了我的y_列车和y_测试： from collections import Counter Counter(y_train) Counter(y_test) 我看到他们都有相同的31门课 OUTPUT Counter(y_train)

我试着比较分类器RandomForest（RF）、SupportVectorMachine（SVM）和多层感知器（MLP），查看它们的

分类报告。它是对分类数据的多分类。
相同数据（378443个条目，7列）
同样的是，
相同的y_测试，。
我检查了我的y_列车和y_测试：
 from collections import Counter
    Counter(y_train)
    Counter(y_test)

我看到他们都有相同的31门课
OUTPUT Counter(y_train):             OUTPUT Counter(y_test)
Counter({'Class 1': 201096,        Counter({'Class 1': 133917,
         'Class 2': 24109,                  'Class 11': 5,
         'Class 3': 731,                    'Class 2': 16167,
         'Class 4': 851,                    'Class 3': 475,
         'Class 5': 60,                     'Class 4': 628,
         'Class 6': 7,                      'Class 8': 7,
         'Class 7': 19,                     'Class 12': 19,
         'Class 8': 3,                      'Class 21': 3,
         'Class 9': 12,                     'Class 25': 10,
         'Class 10': 7,                     'Class 18': 6,
         'Class 11': 5,                     'Class 9': 12,
         'Class 12': 28,                    'Class 5': 41,
         'Class 13': 5,                     'Class 16': 4,
         'Class 14': 8,                     'Class 7': 14,
         'Class 15': 9,                     'Class 17': 3,
         'Class 16': 6,                     'Class 30': 3,
         'Class 17': 7,                     'Class 26': 4,
         'Class 18': 4,                     'Class 27': 4,
         'Class 19': 6,                     'Class 14': 2,
         'Class 20': 5,                     'Class 28': 5,
         'Class 21': 7,                     'Class 13': 5,
         'Class 22': 6,                     'Class 24': 9,
         'Class 23': 7,                     'Class 15': 7,
         'Class 24': 15,                    'Class 31': 5,
         'Class 25': 10,                    'Class 10': 3,
         'Class 26': 10,                    'Class 23': 3,
         'Class 27': 6,                     'Class 29': 1,
         'Class 28': 5,                     'Class 22': 4,
         'Class 29': 9,                     'Class 20': 5,
         'Class 30': 7,                     'Class 6': 3,
         'Class 31': 5})                    'Class 19': 4})

但我在打印分类报告（y_train，y_pred）时收到此警告：
未定义的度量警告：精度和F分数定义不清且
在没有预测样本的标签中设置为0.0。”精度'，
“预测”、平均、警告（针对）
这意味着并非所有标签都包含在y_pred中，也就是说，y_测试中有一些标签是分类器永远无法预测的
有了RF，一切都很好（只有一个类在分类报告中获得0.00的精确性和召回率）
SVM和MLP的分类报告在一半的类中包含0.00s
MLP可以预测13个等级（精确度/召回率超过0.00）：等级1,2,3,4,6,8,9,12,14,20,21,23,25

我所有的代码：
 #data is imported

Y = data['class']
data=data.drop['class']
labEn = {}
#LabelEncoding for cols
for x in range(len(data.columns)):
    #creating the LabelEncoder for col x 
    labEn[x] = LabelEncoder()
    dfPre[data.columns[x]] = labEn[x].fit_transform(data[data.columns[x]])
    #for unknown labels
    labEn[x].classes_ = np.append(labEn[x].classes_, '-unknown-')

X = data
X.shape #Output:(378443, 7)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=0)
    ###### amount of train and test data####################
X_train.shape, y_train.shape
    #Output: (227065, 7)
(227065,)
X_test.shape, y_test.shape
    #Output: (151378, 7)
(151378,)

from collections import Counter
print(Counter(y_train))
print(Counter(y_test))

##RF
rfclf = RandomForestClassifier(class_weight = 'balanced')
rfclf.fit(X_train,  y_train)

y_train_pred = cross_val_predict(rfclf, X_train, y_train, cv=3)
y_test_pred=cross_val_predict(rfclf, X_test, y_test, cv=3)

print(classification_report(y_train, y_train_pred))

print(classification_report(y_test, y_test_pred))

##for SVM and MLP: Scaling data
start_time_standardscaler = time.time()
scaler = StandardScaler()
scaler.fit(X_train) 

X_train_scaled=scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test) 

#for svm: One Hot Encoder - I also tried it without!
enc = OneHotEncoder(handle_unknown='ignore')
enc.fit(X_train_scaled)
X_train_X_test_ohencoded=enc.transform(X_train_scaled)
X_test_ohencoded=enc.transform(X_test_scaled)

##SVM
svmclf=svm.SVC(kernel='rbf', gamma='scale')
svmclf.fit(X_train_scaled,y_train)#also tried X_train_ohencoded,y_train)

y_train_pred_scaled = cross_val_predict(svmclf, X_train_scaled, y_train, cv=10)
#y_train_pred_ohencoded = cross_val_predict(svmclf, X_train_ohencoded, y_train, cv=10)


print(classification_report(y_train, y_train_pred_scaled))
#print(classification_report(y_train, y_train_pred_ohencoded))

print(classification_report(y_test, y_test_pred))
#print(classification_report(y_test, y_test_pred_ohencoded))

##MLP
mlpclf = MLPClassifier(solver='adam', alpha=1e-5, hidden_layer_sizes=(50, 100), random_state=1)
mlpclf.fit(X_train_scaled,y_train)

y_train_pred = cross_val_predict(mlpclf, X_train_scaled, y_train, cv=10)
y_test_pred=cross_val_predict(rfclf, X_test_scaled, y_test, cv=3)

print(classification_report(y_train, y_train_pred))
print(classification_report(y_test, y_test_pred))



##Prediction
#works well since the 5 classes all classifiers could train has to be predicted here (how lucky)

#newdata is imported
#Scaler from above is used 
newdata_scaled=scaler.transform(newdata)
#Encoder from above is used
newdata_enc=enc.transform(newdata_scaled)

rfclf.predict(newdata)
svmclf.predict(newdata_enc)
mlpclf.predict(newdata_scaled)

您有一个严重的不平衡数据集，其中99%的数据集只分类在31个类中的2个类中。除了数据集的大小之外，分布变化（每个类的百分比）也很重要。您的模型将倾向于过度适合高百分比的类，因为它将获得高精度
解决这个问题的一种方法是为少数群体生成合成样本。
SMOTE（合成少数超采样技术）可以通过IMBRearn
python包应用于您的数据
您可以去查看更多详细信息
您的培训数据集有多大？培训集中31个班级的分布情况如何？您的模型可能会失败，因为它可以针对某些类进行训练的数据太少。还有你有多少功能？@cho_uc我更新了我的问题。您可以在第一个表中看到类的分布。Trainingdata（X_train.shape）：227065，Testdata（X_test.shape）：151378，功能：7。非常感谢！顺便说一句，这是否意味着它会经常预测训练数据中最大的一类，因为它过度拟合？是的，它会倾向于预测最大的一类。这就是为什么小班学生的成绩为0。您可能还需要进一步调整：重新考虑您的模型中是否真的需要那么多的类（例如，如果5个类已经可以表示您想要的，而不是31个类）