Python 如何从mnist数据的原始大小创建样本子集，同时保留所有10个类_Python_Numpy_Machine Learning_Mnist_Numpy Ndarray

Python 如何从mnist数据的原始大小创建样本子集，同时保留所有10个类

python numpy machine-learning

Python 如何从mnist数据的原始大小创建样本子集，同时保留所有10个类,python,numpy,machine-learning,mnist,numpy-ndarray,Python,Numpy,Machine Learning,Mnist,Numpy Ndarray,假设X，Y=load_mnist，其中X和Y是包含整个mnist的张量。现在我想要一个更小的数据比例，使我的代码运行更快，但我需要保持所有10个类都在那个里，并且以一种平衡的方式。有一种简单的方法可以做到这一点吗？scikit learn的train_test_split旨在将数据拆分为train和test类，但您可以使用它使用分层参数创建数据集的平衡子集。您可以只指定所需的列车/测试规模比例，从而获得更小的分层数据样本。就你而言： from sklearn.model_selection im

假设X，Y=load_mnist，其中X和Y是包含整个mnist的张量。现在我想要一个更小的数据比例，使我的代码运行更快，但我需要保持所有10个类都在那个里，并且以一种平衡的方式。有一种简单的方法可以做到这一点吗？

scikit learn的train_test_split旨在将数据拆分为train和test类，但您可以使用它使用分层参数创建数据集的平衡子集。您可以只指定所需的列车/测试规模比例，从而获得更小的分层数据样本。就你而言：

from sklearn.model_selection import train_test_split

X_1, X_2, Y_1, Y_2 = train_test_split(X, Y, stratify=Y, test_size=0.5)

如果希望通过更多控制来实现这一点，可以使用numpy.random.randint生成子集大小的索引，并对原始数组进行采样，如下代码所示：

# input data, assume that you've 10K samples
In [77]: total_samples = 10000
In [78]: X, Y = np.random.random_sample((total_samples, 784)), np.random.randint(0, 10, total_samples)

# out of these 10K, we want to pick only 500 samples as a subset
In [79]: subset_size = 500

# generate uniformly distributed indices, of size `subset_size`
In [80]: subset_idx = np.random.choice(total_samples, subset_size)

# simply index into the original arrays to obtain the subsets
In [81]: X_subset, Y_subset = X[subset_idx], Y[subset_idx]

In [82]: X_subset.shape, Y_subset.shape
Out[82]: ((500, 784), (500,))

分层将确保班级的比例

如果要执行K折叠，则

from sklearn.model_selection import StratifiedShuffleSplit
sss = StratifiedShuffleSplit(n_splits=5, test_size=0.5, random_state=0)

for train_index, test_index in sss.split(X, y):
       print("TRAIN:", train_index, "TEST:", test_index)
       X_train, X_test = X.iloc[train_index], X.iloc[test_index]
       y_train, y_test = y.iloc[train_index], y.iloc[test_index]

检查sklearn文档。

但这并不能确保班级分布平衡？这是指数的统一抽样。因此，在理论上，它确实保证了均匀采样

from sklearn.model_selection import StratifiedShuffleSplit
sss = StratifiedShuffleSplit(n_splits=5, test_size=0.5, random_state=0)

for train_index, test_index in sss.split(X, y):
       print("TRAIN:", train_index, "TEST:", test_index)
       X_train, X_test = X.iloc[train_index], X.iloc[test_index]
       y_train, y_test = y.iloc[train_index], y.iloc[test_index]