Pandas 如何将数据分成3组（培训、验证和测试）？_Pandas_Numpy_Dataframe_Machine Learning_Scikit Learn

Pandas 如何将数据分成3组（培训、验证和测试）？

pandas numpy dataframe machine-learning scikit-learn

Pandas 如何将数据分成3组（培训、验证和测试）？,pandas,numpy,dataframe,machine-learning,scikit-learn,Pandas,Numpy,Dataframe,Machine Learning,Scikit Learn,我有一个熊猫数据框，我想把它分成3个独立的集合。我知道使用fromsklearn.cross_validation，可以将数据分成两组（训练和测试）。但是，我找不到任何将数据分成三组的解决方案。最好，我想要原始数据的索引我知道一个解决办法是使用train\u test\u split两次，并以某种方式调整索引。但是是否有更标准/内置的方法将数据分成3组而不是2组？注意：函数用于处理随机集创建的种子。您不应该依赖于不会随机化集合的集合拆分 import numpy as np import p

我有一个熊猫数据框，我想把它分成3个独立的集合。我知道使用from

sklearn.cross_validation

，可以将数据分成两组（训练和测试）。但是，我找不到任何将数据分成三组的解决方案。最好，我想要原始数据的索引

我知道一个解决办法是使用

train\u test\u split

两次，并以某种方式调整索引。但是是否有更标准/内置的方法将数据分成3组而不是2组？

注意：函数用于处理随机集创建的种子。您不应该依赖于不会随机化集合的集合拆分

import numpy as np
import pandas as pd

def train_validate_test_split(df, train_percent=.6, validate_percent=.2, seed=None):
    np.random.seed(seed)
    perm = np.random.permutation(df.index)
    m = len(df.index)
    train_end = int(train_percent * m)
    validate_end = int(validate_percent * m) + train_end
    train = df.iloc[perm[:train_end]]
    validate = df.iloc[perm[train_end:validate_end]]
    test = df.iloc[perm[validate_end:]]
    return train, validate, test

示范

Numpy解决方案。我们将首先洗牌整个数据集（

df.sample（frac=1，random_state=42）

），然后将数据集拆分为以下部分：

60%列车组
20%验证集
20%测试集

[int（.6*len（df）），int（.8*len（df））]

-是一个

索引的数组
下面是np.split（）
用法的一个小演示-让我们将20个元素数组分成以下部分：80%、10%、10%：
In [45]: a = np.arange(1, 21)

In [46]: a
Out[46]: array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20])

In [47]: np.split(a, [int(.8 * len(a)), int(.9 * len(a))])
Out[47]:
[array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16]),
 array([17, 18]),
 array([19, 20])]

然而，将数据集划分为train
，test
，cv
和0.6
，0.2
，0.2
的一种方法是使用train\u test\u split
方法两次
from sklearn.model_selection import train_test_split

x, x_test, y, y_test = train_test_split(xtrain,labels,test_size=0.2,train_size=0.8)
x_train, x_cv, y_train, y_cv = train_test_split(x,y,test_size = 0.25,train_size =0.75)

使用train_test_split
非常方便，无需在划分为多个集合后重新编制索引，也无需编写额外的代码。上面的最佳答案并没有提到，通过使用train\u test\u split
分隔两次，不改变分区大小不会给出最初预期的分区：
x_train, x_remain = train_test_split(x, test_size=(val_size + test_size))

然后x_中验证集和测试集的部分保持变化，并可计算为
new_test_size = np.around(test_size / (val_size + test_size), 2)
# To preserve (new_test_size + new_val_size) = 1.0 
new_val_size = 1.0 - new_test_size

x_val, x_test = train_test_split(x_remain, test_size=new_test_size)

在这种情况下，所有初始分区都会被保存。
这里有一个Python函数，它通过分层采样将数据帧拆分为训练、验证和测试数据帧。它通过调用scikit learn的函数train\u test\u split（）
两次来执行此拆分
将熊猫作为pd导入
从sklearn.model\u选择导入列车\u测试\u拆分
def将分层分为列值测试（df输入，分层colname='y'，
压裂系列=0.6，压裂数值=0.15，压裂试验=0.25，
随机状态=无）：
'''
将数据帧拆分为三个子集（train、val和test）
以下是用户提供的分数比率，其中每个子集
按特定列中的值分层（即，每个子集具有
列中值的相同相对频率）。它实现了这一点
通过运行两次列车\u测试\u split（）进行拆分。
参数
----------
df_输入：数据帧
输入要拆分的数据帧。
分层名称：str
将用于分层的列的名称。通常
此列将用于标签。
压裂车：浮子
分形：浮点数
压裂试验：浮球
将数据帧拆分为train、val和val的比率
测试数据。这些值应表示为浮点数，并应
总和为1.0。
随机状态：int、None或RandomStateInstance
要传递给列车测试分割（）的值。
退换商品
-------
测向列、测向值、测向测试：
包含三个拆分的数据帧。
'''
如果压裂液系列+压裂液+压裂液测试！=1.0:
raise VALUERROR（'分数%f，%f，%f加起来不等于1.0'%\
（压裂系列、压裂价值、压裂测试）
如果分层_colname不在df_input.columns中：
raise VALUERROR（“%s”不是数据帧“%”（分层\u colname）中的列）
X=df_输入#包含所有列。
y=df_输入[[stratify_colname]]#仅用于分层的列的数据帧。
#将原始数据帧拆分为列车和临时数据帧。
df_系列，df_温度，y_系列，y_温度=系列测试_分割（X，
Y
分层=y，
试验尺寸=（1.0-压裂机组），
随机状态=随机状态）
#将临时数据帧拆分为val和测试数据帧。
相对压裂试验=压裂试验/（压裂试验值+压裂试验）
df_val，df_test，y_val，y_test=列车试验分割（df_temp，
y_temp，
分层=y_温度，
测试尺寸=相对压裂测试，
随机状态=随机状态）
断言len（df_输入）==len（df_序列）+len（df_值）+len（df_测试）
返回测向列，测向值，测向测试

下面是一个完整的工作示例
考虑一个数据集，该数据集具有一个要在其上执行分层的标签。此标签在原始数据集中有自己的分布，例如75%foo
、15%bar
和10%baz
。现在，让我们使用60/20/20比率将数据集拆分为训练、验证和测试子集，其中每个拆分保留相同的标签分布。请参见下图：

以下是示例数据集：
df=pd.DataFrame（{'A'：列表（范围（01000）），
“B”：列表（范围（100,0，-1）），
‘标签’：['foo']*75+['bar']*15+['baz']*10}）
df.head（）
#标签
#0 100 foo
#1199富
#298富
#397富
#4 96富
形状
# (100, 3)
df.label.value_counts（）
#富75
#酒吧15
In [305]: train, validate, test = \
              np.split(df.sample(frac=1, random_state=42), 
                       [int(.6*len(df)), int(.8*len(df))])

In [306]: train
Out[306]:
          A         B         C         D         E
0  0.046919  0.792216  0.206294  0.440346  0.038960
2  0.301010  0.625697  0.604724  0.936968  0.870064
1  0.642237  0.690403  0.813658  0.525379  0.396053
9  0.488484  0.389640  0.599637  0.122919  0.106505
8  0.842717  0.793315  0.554084  0.100361  0.367465
7  0.185214  0.603661  0.217677  0.281780  0.938540

In [307]: validate
Out[307]:
          A         B         C         D         E
5  0.806176  0.008896  0.362878  0.058903  0.026328
6  0.145777  0.485765  0.589272  0.806329  0.703479

In [308]: test
Out[308]:
          A         B         C         D         E
4  0.521640  0.332210  0.370177  0.859169  0.401087
3  0.333348  0.964011  0.083498  0.670386  0.169619

In [45]: a = np.arange(1, 21)

In [46]: a
Out[46]: array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20])

In [47]: np.split(a, [int(.8 * len(a)), int(.9 * len(a))])
Out[47]:
[array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16]),
 array([17, 18]),
 array([19, 20])]

from sklearn.model_selection import train_test_split

x, x_test, y, y_test = train_test_split(xtrain,labels,test_size=0.2,train_size=0.8)
x_train, x_cv, y_train, y_cv = train_test_split(x,y,test_size = 0.25,train_size =0.75)

x_train, x_remain = train_test_split(x, test_size=(val_size + test_size))

new_test_size = np.around(test_size / (val_size + test_size), 2)
# To preserve (new_test_size + new_val_size) = 1.0 
new_val_size = 1.0 - new_test_size

x_val, x_test = train_test_split(x_remain, test_size=new_test_size)

# 1st case: df contains X and y (where y is the "target" column of df)
df_shuffled = df.sample(frac=1)
X_shuffled = df_shuffled.drop("target", axis = 1)
y_shuffled = df_shuffled["target"]

# 2nd case: X and y are two separated dataframes
X_shuffled = X.sample(frac=1)
y_shuffled = y[X_shuffled.index]

# We do the split as in the chosen answer
X_train, X_validation, X_test = np.split(X_shuffled, [int(0.6*len(X)),int(0.8*len(X))])
y_train, y_validation, y_test = np.split(y_shuffled, [int(0.6*len(X)),int(0.8*len(X))])

my_test_size = 0.10

X_train_, X_test, y_train_, y_test = train_test_split(
    df.index.values,
    df.label.values,
    test_size=my_test_size,
    random_state=42,
    stratify=df.label.values,    
)

my_val_size = 0.20

X_train, X_val, y_train, y_val = train_test_split(
    df.loc[X_train_].index.values,
    df.loc[X_train_].label.values,
    test_size=my_val_size,
    random_state=42,
    stratify=df.loc[X_train_].label.values,  
)

# data_type is not necessary. 
df['data_type'] = ['not_set']*df.shape[0]
df.loc[X_train, 'data_type'] = 'train'
df.loc[X_val, 'data_type'] = 'val'
df.loc[X_test, 'data_type'] = 'test'

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

from tensorflow.keras import Model

model = Model(input_layer, out)

[...]

history = model.fit(x=X_train, y=y_train, [...], validation_split = 0.3)