Python 3.x sklearn.datasets.make_分类无法生成平衡类_Python 3.x_Scikit Learn

Python 3.x sklearn.datasets.make_分类无法生成平衡类

python-3.x scikit-learn

Python 3.x sklearn.datasets.make_分类无法生成平衡类,python-3.x,scikit-learn,Python 3.x,Scikit Learn,我试图使用sklearn库中的进行分类来生成分类任务的数据，我希望每个类都有4个样本如果类数小于19，则行为正常 from sklearn.datasets import make_blobs, make_classification import numpy as np data = make_classification(n_samples=76, n_features=5, n_informative=5, n_redundant=0, n_repeated=0,

我试图使用sklearn库中的

进行分类

来生成分类任务的数据，我希望每个类都有4个样本

如果类数小于19，则行为正常

from sklearn.datasets import make_blobs, make_classification
import numpy as np
data = make_classification(n_samples=76, n_features=5, n_informative=5, n_redundant=0, n_repeated=0, 
                           n_classes=19, n_clusters_per_class=1, weights=None, flip_y=0, class_sep=1.0, 
                           shuffle=False, random_state=101)
print(data[1])
[ 0  0  0  0  1  1  1  1  2  2  2  2  3  3  3  3  4  4  4  4  5  5  5  5
  6  6  6  6  7  7  7  7  8  8  8  8  9  9  9  9 10 10 10 10 11 11 11 11
 12 12 12 12 13 13 13 13 14 14 14 14 15 15 15 15 16 16 16 16 17 17 17 17
 18 18 18 18]

但是，如果类的数量等于或大于20，则第一个类将有5个样本，最后一个类将只有3个样本，这是不平衡的

data = make_classification(n_samples=80, n_features=5, n_informative=5, n_redundant=0, n_repeated=0, 
                           n_classes=20, n_clusters_per_class=1, weights=None, flip_y=0, class_sep=1.0, 
                           shuffle=False, random_state=101)
print(data[1])
[ 0  0  0  0  0  1  1  1  1  2  2  2  2  3  3  3  3  4  4  4  4  5  5  5
  5  6  6  6  6  7  7  7  7  8  8  8  8  9  9  9  9 10 10 10 10 11 11 11
 11 12 12 12 12 13 13 13 13 14 14 14 14 15 15 15 15 16 16 16 16 17 17 17
 17 18 18 18 18 19 19 19]

检查文档时，我发现

weight

参数控制

类的比例：
权重
：浮动列表或无（默认值=无）
分配给每个类别的样本比例。如果没有，那么
班级是平衡的。请注意，如果len（权重）==n_类-1，则
将自动推断最后一个类的权重。超过n_个样本
如果重量之和超过1，则可返回样品
因此，我尝试使用以下代码显式输入比例
data = make_classification(n_samples=80, n_features=5, n_informative=5, n_redundant=0, n_repeated=0, 
                           n_classes=20, n_clusters_per_class=1, weights=list(np.ones(20)), flip_y=0, class_sep=1.0, 
                           shuffle=False, random_state=101)
print(data[1])
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0]

然而，生成的类是完全错误的
我不知道为什么这个函数的行为是这样的。当n_类
大于或等于20时，如何确保平衡类？
虽然未明确提及且容易混淆，但参数权重
要求样本的“比例”。它不会自动将数字转换为比例
因此，如果样本总数=80，并且您希望将40个样本分配给类别1，则比例将变为0.5

但是，您提供的比例如下所示：
[1.0, 1.0, 1.0, 1.0,.................., 1.0, 1.0, 1.0, 1.0]

这就是错误的根源。该方法将1.0作为第一个类（在您的情况下为0），并忽略所有其他类
这样做：
n_classes = 20
weights=list(np.ones(20)/n_classes)  <== Making proportions correct

data = make_classification(n_samples=80, n_features=5, n_informative=5, n_redundant=0, n_repeated=0, 
                           n_classes=n_classes, n_clusters_per_class=1, weights=weights, flip_y=0, class_sep=1.0, 
                           shuffle=False, random_state=101)

最后一行：
如果权重之和超过1，则可能返回n_个以上的样本
似乎增加了混乱
当您将1.0
作为所有类的比例传递时，它应该返回80*20=1600个样本，每个类返回80个
但事实并非如此。它在内部正确生成样本，但只返回前80个样本（由n_samples
param定义）。这就是为什么在生成的数据中只返回一个类（0）。您应该将此作为一个问题发布在github的页面上：
虽然没有明确提及，而且令人困惑，但参数权重
需要样本的“比例”。它不会自动将数字转换为比例
因此，如果样本总数=80，并且您希望将40个样本分配给类别1，则比例将变为0.5

但是，您提供的比例如下所示：
[1.0, 1.0, 1.0, 1.0,.................., 1.0, 1.0, 1.0, 1.0]

这就是错误的根源。该方法将1.0作为第一个类（在您的情况下为0），并忽略所有其他类
这样做：
n_classes = 20
weights=list(np.ones(20)/n_classes)  <== Making proportions correct

data = make_classification(n_samples=80, n_features=5, n_informative=5, n_redundant=0, n_repeated=0, 
                           n_classes=n_classes, n_clusters_per_class=1, weights=weights, flip_y=0, class_sep=1.0, 
                           shuffle=False, random_state=101)

最后一行：
如果权重之和超过1，则可能返回n_个以上的样本
似乎增加了混乱
当您将1.0
作为所有类的比例传递时，它应该返回80*20=1600个样本，每个类返回80个
但事实并非如此。它在内部正确生成样本，但只返回前80个样本（由n_samples
param定义）。这就是为什么在生成的数据中只返回一个类（0）。您应该将此作为一个问题发布在github的页面上：