Scikit learn 以下SVC的最佳类权重参数？_Scikit Learn_Svc

Scikit learn 以下SVC的最佳类权重参数？

scikit-learn

Scikit learn 以下SVC的最佳类权重参数？,scikit-learn,svc,Scikit Learn,Svc,您好，我正在与SKL合作学习执行分类器，我有以下标签分布： label : 0 frecuency : 119 label : 1 frecuency : 1615 label : 2 frecuency : 197 label : 3 frecuency : 70 label : 4 frecuency : 203 label : 5 frecuency : 137 label : 6 frecuency : 18 label : 7 frecuency : 142 label

您好，我正在与SKL合作学习执行分类器，我有以下标签分布：

label : 0 frecuency :  119
label : 1 frecuency :  1615
label : 2 frecuency :  197
label : 3 frecuency :  70
label : 4 frecuency :  203
label : 5 frecuency :  137
label : 6 frecuency :  18
label : 7 frecuency :  142
label : 8 frecuency :  15
label : 9 frecuency :  182
label : 10 frecuency :  986
label : 12 frecuency :  73
label : 13 frecuency :  27
label : 14 frecuency :  81
label : 15 frecuency :  168
label : 18 frecuency :  107
label : 21 frecuency :  125
label : 22 frecuency :  172
label : 23 frecuency :  3870
label : 25 frecuency :  2321
label : 26 frecuency :  25
label : 27 frecuency :  314
label : 28 frecuency :  76
label : 29 frecuency :  116

有一点很明显，我使用的是一个不平衡的数据集，我有很多25,23,1,10类的标签，我在培训后得到了以下坏结果：

             precision    recall  f1-score   support

          0       0.00      0.00      0.00        31
          1       0.61      0.23      0.34       528
          2       0.00      0.00      0.00        70
          3       0.67      0.06      0.11        32
          4       0.00      0.00      0.00        62
          5       0.78      0.82      0.80        39
          6       0.00      0.00      0.00         3
          7       0.00      0.00      0.00        46
          8       0.00      0.00      0.00         5
          9       0.00      0.00      0.00        62
         10       0.14      0.01      0.02       313
         12       0.00      0.00      0.00        30
         13       0.31      0.57      0.40         7
         14       0.00      0.00      0.00        35
         15       0.00      0.00      0.00        56
         18       0.00      0.00      0.00        35
         21       0.00      0.00      0.00        39
         22       0.00      0.00      0.00        66
         23       0.41      0.74      0.53      1278
         25       0.28      0.39      0.33       758
         26       0.50      0.25      0.33         8
         27       0.29      0.02      0.03       115
         28       1.00      0.61      0.76        23
         29       0.00      0.00      0.00        42

avg / total       0.33      0.39      0.32      3683

weight={}
for i,v in enumerate(uniqLabels):
        weight[v]=labels_cluster.count(uniqLabels[i])/len(labels_cluster)

for i,v in weight.items():
        print(i,v)
print(weight)

from sklearn import svm
clf2= svm.SVC(kernel='linear',class_weight=weight)

我得到了很多零，SVC无法从几个类中学习，我使用的超参数如下：

from sklearn import svm
clf2= svm.SVC(kernel='linear')

为了克服这个问题，我为每个类建立了一个字典，每个类的权重如下：

             precision    recall  f1-score   support

          0       0.00      0.00      0.00        31
          1       0.61      0.23      0.34       528
          2       0.00      0.00      0.00        70
          3       0.67      0.06      0.11        32
          4       0.00      0.00      0.00        62
          5       0.78      0.82      0.80        39
          6       0.00      0.00      0.00         3
          7       0.00      0.00      0.00        46
          8       0.00      0.00      0.00         5
          9       0.00      0.00      0.00        62
         10       0.14      0.01      0.02       313
         12       0.00      0.00      0.00        30
         13       0.31      0.57      0.40         7
         14       0.00      0.00      0.00        35
         15       0.00      0.00      0.00        56
         18       0.00      0.00      0.00        35
         21       0.00      0.00      0.00        39
         22       0.00      0.00      0.00        66
         23       0.41      0.74      0.53      1278
         25       0.28      0.39      0.33       758
         26       0.50      0.25      0.33         8
         27       0.29      0.02      0.03       115
         28       1.00      0.61      0.76        23
         29       0.00      0.00      0.00        42

avg / total       0.33      0.39      0.32      3683

weight={}
for i,v in enumerate(uniqLabels):
        weight[v]=labels_cluster.count(uniqLabels[i])/len(labels_cluster)

for i,v in weight.items():
        print(i,v)
print(weight)

from sklearn import svm
clf2= svm.SVC(kernel='linear',class_weight=weight)

这些是数字和输出，我只是将确定标签的元素数除以标签集中的元素总数，这些数字的总和是1：

0 0.010664037996236221
1 0.14472622994892015
2 0.01765391164082803
3 0.006272963527197778
4 0.018191594228873554
5 0.012277085760372793
6 0.0016130477641365713
7 0.012725154583744062
8 0.0013442064701138096
9 0.01630970517071422
10 0.0883591719688144
12 0.0065418048212205395
13 0.002419571646204857
14 0.007258714938614571
15 0.015055112465274667
18 0.009588672820145173
21 0.011201720584281746
22 0.015413567523971682
23 0.34680526928936284
25 0.20799354780894344
26 0.0022403441168563493
27 0.028138722107715744
28 0.006810646115243301
29 0.01039519670221346

使用此权重字典重试，如下所示：

             precision    recall  f1-score   support

          0       0.00      0.00      0.00        31
          1       0.61      0.23      0.34       528
          2       0.00      0.00      0.00        70
          3       0.67      0.06      0.11        32
          4       0.00      0.00      0.00        62
          5       0.78      0.82      0.80        39
          6       0.00      0.00      0.00         3
          7       0.00      0.00      0.00        46
          8       0.00      0.00      0.00         5
          9       0.00      0.00      0.00        62
         10       0.14      0.01      0.02       313
         12       0.00      0.00      0.00        30
         13       0.31      0.57      0.40         7
         14       0.00      0.00      0.00        35
         15       0.00      0.00      0.00        56
         18       0.00      0.00      0.00        35
         21       0.00      0.00      0.00        39
         22       0.00      0.00      0.00        66
         23       0.41      0.74      0.53      1278
         25       0.28      0.39      0.33       758
         26       0.50      0.25      0.33         8
         27       0.29      0.02      0.03       115
         28       1.00      0.61      0.76        23
         29       0.00      0.00      0.00        42

avg / total       0.33      0.39      0.32      3683

weight={}
for i,v in enumerate(uniqLabels):
        weight[v]=labels_cluster.count(uniqLabels[i])/len(labels_cluster)

for i,v in weight.items():
        print(i,v)
print(weight)

from sklearn import svm
clf2= svm.SVC(kernel='linear',class_weight=weight)

我得到：

             precision    recall  f1-score   support

          0       0.00      0.00      0.00        31
          1       0.90      0.19      0.31       528
          2       0.00      0.00      0.00        70
          3       0.00      0.00      0.00        32
          4       0.00      0.00      0.00        62
          5       0.00      0.00      0.00        39
          6       0.00      0.00      0.00         3
          7       0.00      0.00      0.00        46
          8       0.00      0.00      0.00         5
          9       0.00      0.00      0.00        62
         10       0.00      0.00      0.00       313
         12       0.00      0.00      0.00        30
         13       0.00      0.00      0.00         7
         14       0.00      0.00      0.00        35
         15       0.00      0.00      0.00        56
         18       0.00      0.00      0.00        35
         21       0.00      0.00      0.00        39
         22       0.00      0.00      0.00        66
         23       0.36      0.99      0.52      1278
         25       0.46      0.01      0.02       758
         26       0.00      0.00      0.00         8
         27       0.00      0.00      0.00       115
         28       0.00      0.00      0.00        23
         29       0.00      0.00      0.00        42

avg / total       0.35      0.37      0.23      3683

由于我没有得到好的结果，我非常欣赏自动调整每个类的权重并在SVC中表达出来的建议，我没有很多处理不平衡问题的经验，所以所有建议都很受欢迎。

看起来你做的与你应该做的相反。特别是，您希望在较小的类上放置更高的权重，以便分类器在这些类的训练期间受到更多的惩罚。一个好的起点是设置

class\u weight=“balanced”

是的，我同意你的看法，我的做法正好相反我想用这些数字反映我数据集的真实情况，现在我相信我需要用这些数字来构建一个新的规则，实现相反的效果，但考虑到这些数字——