Python &引用；ValueError:y中填充最少的类只有1个成员，这太少了；即使这些类已经被删除_Python_Machine Learning_Scikit Learn_Shuffle

Python &引用；ValueError:y中填充最少的类只有1个成员，这太少了；即使这些类已经被删除

python machine-learning scikit-learn

Python &引用；ValueError:y中填充最少的类只有1个成员，这太少了；即使这些类已经被删除,python,machine-learning,scikit-learn,shuffle,Python,Machine Learning,Scikit Learn,Shuffle,在一些多标签数据上使用来自sklearn的StratifiedShuffleSplit时遇到问题。以下自包含的示例最好地解释了该问题： import numpy as np from sklearn.model_selection import StratifiedShuffleSplit # Generate some data np.random.seed(0) n_samples = 10 n_features = 40 n_labels = 20 x = np.random.rand

在一些多标签数据上使用来自

sklearn

的

StratifiedShuffleSplit

时遇到问题。以下自包含的示例最好地解释了该问题：

import numpy as np
from sklearn.model_selection import StratifiedShuffleSplit

# Generate some data
np.random.seed(0)
n_samples = 10
n_features = 40
n_labels = 20

x = np.random.rand(n_samples, n_features)
y = np.zeros((n_samples, n_labels))
for col in range(n_labels):
    n_instances = np.random.randint(5)
    indices = np.random.permutation(n_samples)[:n_instances]
    y[indices,col] = 1

print('Features training set shape:', x.shape)
print('Labels from training set shape:', y.shape)
print('Are there any labels with fewer than two instances?', np.any(y.sum(axis=0) < 2), '\n')
print(y, '\n')

# Remove labels which are represented fewer than two times in the training set,
# since this messes with StratifiedShuffleSplit below.
label_indices_rm = np.where(y.sum(axis=0) < 2)[0]
y = np.delete(y, label_indices_rm, axis=1)

print(len(label_indices_rm), ' labels had fewer than two instances and were removed.')
print('Features from training set shape:', x.shape)
print('Labels from training set shape:', y.shape)
print('Are there any labels with fewer than two instances?', np.any(y.sum(axis=0) < 2), '\n')
print(y, '\n')

sss = StratifiedShuffleSplit(n_splits=1, train_size=0.5)
indices,_ = sss.split(x, y) # gives the training indices

将numpy导入为np
从sklearn.model_选择导入分层hufflesplit
#生成一些数据
np.random.seed（0）
n_样本=10
n_特征=40
n_标签=20
x=np.random.rand（n个样本，n个特征）
y=np.零（（n个样本，n个标签））
对于范围内的列（n_标签）：
n_实例=np.random.randint（5）
索引=np.随机.置换（n_样本）[：n_实例]
y[指数，列]=1
打印（'功能训练集形状：'，x.shape）
打印（'来自训练集形状的标签：'，y.shape）
print（'是否有少于两个实例的标签？'，np.any（y.sum（axis=0）<2），'\n'）
打印（y，“\n”）
#移除在训练集中表示少于两次的标签，
#因为这会干扰下面的分层吹扫。
标签索引=np，其中（y.sum（axis=0）<2）[0]
y=np.删除（y，标签索引，轴=1）
打印（len（label\u index\u rm），“标签少于两个实例，已被删除”。）
打印（'来自训练集形状的特征：'，x.shape）
打印（'来自训练集形状的标签：'，y.shape）
print（'是否有少于两个实例的标签？'，np.any（y.sum（axis=0）<2），'\n'）
打印（y，“\n”）
sss=分层分片（n\u分割=1，序列尺寸=0.5）
指数，sss.split（x，y）#给出了训练指数

这将提供以下输出：

Features from training set shape: (10, 40)
Labels from training set shape: (10, 20)
Are there any labels with fewer than two instances? True 

[[0. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 1. 1.]
 [0. 1. 0. 0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 1. 1. 0.]
 [0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 1. 1. 1. 0. 0. 0. 0. 0.]
 [0. 0. 1. 1. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1. 0. 1. 0. 0. 0. 0. 1. 0.]
 [0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
 [1. 0. 1. 0. 0. 0. 0. 0. 0. 1. 1. 1. 0. 1. 0. 0. 0. 1. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 1. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0.]
 [0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 1. 0. 1. 1. 0. 0. 0. 0. 0.]] 

7  labels had fewer than two instances and were removed.
Features from training set shape: (10, 40)
Labels from training set shape: (10, 13)
Are there any labels with fewer than two instances? False 

[[1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 1.]
 [1. 0. 0. 1. 0. 1. 0. 0. 0. 0. 0. 1. 1.]
 [0. 1. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 1. 1. 1. 0. 0.]
 [0. 1. 1. 1. 0. 0. 0. 1. 0. 0. 0. 1. 0.]
 [1. 0. 0. 0. 0. 1. 1. 1. 0. 1. 0. 0. 1.]
 [0. 0. 1. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0.]
 [0. 1. 0. 0. 0. 1. 1. 1. 0. 1. 0. 1. 0.]
 [0. 0. 0. 1. 0. 1. 0. 0. 1. 0. 0. 1. 0.]
 [0. 0. 1. 0. 0. 0. 0. 1. 0. 1. 1. 0. 0.]] 

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-45-a490e96dd0e0> in <module>()
     32 
     33 sss = StratifiedShuffleSplit(n_splits=1, train_size=0.5)
---> 34 indices,_ = sss.split(x, y) # gives the training indices

~/miniconda3/lib/python3.6/site-packages/sklearn/model_selection/_split.py in split(self, X, y, groups)
   1202         """
   1203         X, y, groups = indexable(X, y, groups)
-> 1204         for train, test in self._iter_indices(X, y, groups):
   1205             yield train, test
   1206 

~/miniconda3/lib/python3.6/site-packages/sklearn/model_selection/_split.py in _iter_indices(self, X, y, groups)
   1544         class_counts = np.bincount(y_indices)
   1545         if np.min(class_counts) < 2:
-> 1546             raise ValueError("The least populated class in y has only 1"
   1547                              " member, which is too few. The minimum"
   1548                              " number of groups for any class cannot"

ValueError: The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2.

来自训练集形状的特征：（10,40）
来自训练集形状的标签：（10,20）
是否有少于两个实例的标签？真的
[[0. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 1. 1.]
[0. 1. 0. 0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 1. 1. 0.]
[0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 1. 1. 1. 0. 0. 0. 0. 0.]
[0. 0. 1. 1. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0.]
[0. 1. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1. 0. 1. 0. 0. 0. 0. 1. 0.]
[0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
[1. 0. 1. 0. 0. 0. 0. 0. 0. 1. 1. 1. 0. 1. 0. 0. 0. 1. 0. 0.]
[0. 0. 0. 0. 0. 0. 1. 0. 0. 1. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0.]
[0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 1. 0. 1. 1. 0. 0. 0. 0. 0.]] 
7个标签少于两个实例并已删除。
来自训练集形状的特征：（10,40）
来自训练集形状的标签：（10,13）
是否有少于两个实例的标签？假的
[[1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 1.]
[1. 0. 0. 1. 0. 1. 0. 0. 0. 0. 0. 1. 1.]
[0. 1. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 1. 0. 1. 1. 1. 0. 0.]
[0. 1. 1. 1. 0. 0. 0. 1. 0. 0. 0. 1. 0.]
[1. 0. 0. 0. 0. 1. 1. 1. 0. 1. 0. 0. 1.]
[0. 0. 1. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0.]
[0. 1. 0. 0. 0. 1. 1. 1. 0. 1. 0. 1. 0.]
[0. 0. 0. 1. 0. 1. 0. 0. 1. 0. 0. 1. 0.]
[0. 0. 1. 0. 0. 0. 0. 1. 0. 1. 1. 0. 0.]] 
---------------------------------------------------------------------------
ValueError回溯（最近一次调用上次）
在（）
32
33 sss=分层分片（n\u分割=1，序列尺寸=0.5）
--->34个指数，sss。split（x，y）#给出了训练指数
~/miniconda3/lib/python3.6/site-packages/sklearn/model\u selection//u split.py在拆分中（self、X、y、groups）
1202         """
1203 X，y，组=可索引（X，y，组）
->1204对于列车，在自测试指数（X、y、组）中进行测试：
1205屈服序列，试验
1206
~/miniconda3/lib/python3.6/site-packages/sklearn/model\u selection//u split.py in\u iter\u索引（self、X、y、groups）
1544类计数=np.bincount（y指数）
1545如果np.min（类_计数）<2：
->1546 raise VALUERROR（“y中填充最少的类只有1”
1547“成员，太少了。最小值”
1548“任何类别的组数不能”
ValueError:y中填充最少的类只有1个成员，这太少了。任何类的最小组数不能小于2。

我已验证没有只表示了不到两次的标签。为什么我仍然会出现此错误？

不确定您所做的是可能的/正确的。@Stev-Hmm，我需要分层，以避免在没有至少一个标签实例的情况下发生列车拆分。如何确保使用正常的ShuffleSplit？