Python 如何使用pandas从一个数据帧创建测试和训练样本?

Python 如何使用pandas从一个数据帧创建测试和训练样本?,python,python-2.7,pandas,dataframe,Python,Python 2.7,Pandas,Dataframe,我有一个数据框形式的相当大的数据集,我想知道如何将数据框分成两个随机样本(80%和20%),用于训练和测试 谢谢 我会用numpy的randn: In [11]: df = pd.DataFrame(np.random.randn(100, 2)) In [12]: msk = np.random.rand(len(df)) < 0.8 In [13]: train = df[msk] In [14]: test = df[~msk] 我会用numpy的randn: In [11]

我有一个数据框形式的相当大的数据集,我想知道如何将数据框分成两个随机样本(80%和20%),用于训练和测试


谢谢

我会用numpy的
randn

In [11]: df = pd.DataFrame(np.random.randn(100, 2))

In [12]: msk = np.random.rand(len(df)) < 0.8

In [13]: train = df[msk]

In [14]: test = df[~msk]

我会用numpy的
randn

In [11]: df = pd.DataFrame(np.random.randn(100, 2))

In [12]: msk = np.random.rand(len(df)) < 0.8

In [13]: train = df[msk]

In [14]: test = df[~msk]
这是一个很好的方法-它将两个numpy数组作为数据帧进行拆分

from sklearn.model_selection import train_test_split

train, test = train_test_split(df, test_size=0.2)
这是一个很好的方法-它将两个numpy数组作为数据帧进行拆分

from sklearn.model_selection import train_test_split

train, test = train_test_split(df, test_size=0.2)

<>你也可以考虑分层划分为训练和测试集。标准化部门也随机生成训练和测试集,但以保留原始班级比例的方式进行。这使得训练集和测试集更好地反映原始数据集的属性

import numpy as np  

def get_train_test_inds(y,train_proportion=0.7):
    '''Generates indices, making random stratified split into training set and testing sets
    with proportions train_proportion and (1-train_proportion) of initial sample.
    y is any iterable indicating classes of each observation in the sample.
    Initial proportions of classes inside training and 
    testing sets are preserved (stratified sampling).
    '''

    y=np.array(y)
    train_inds = np.zeros(len(y),dtype=bool)
    test_inds = np.zeros(len(y),dtype=bool)
    values = np.unique(y)
    for value in values:
        value_inds = np.nonzero(y==value)[0]
        np.random.shuffle(value_inds)
        n = int(train_proportion*len(value_inds))

        train_inds[value_inds[:n]]=True
        test_inds[value_inds[n:]]=True

    return train_inds,test_inds

DF[TrimeUnDDS ]和DF[Test-Inds]给你原始数据文件DF的训练和测试集。

< P>你也可以考虑分层划分为训练和测试集。标准化部门也随机生成训练和测试集,但以保留原始班级比例的方式进行。这使得训练集和测试集更好地反映原始数据集的属性

import numpy as np  

def get_train_test_inds(y,train_proportion=0.7):
    '''Generates indices, making random stratified split into training set and testing sets
    with proportions train_proportion and (1-train_proportion) of initial sample.
    y is any iterable indicating classes of each observation in the sample.
    Initial proportions of classes inside training and 
    testing sets are preserved (stratified sampling).
    '''

    y=np.array(y)
    train_inds = np.zeros(len(y),dtype=bool)
    test_inds = np.zeros(len(y),dtype=bool)
    values = np.unique(y)
    for value in values:
        value_inds = np.nonzero(y==value)[0]
        np.random.shuffle(value_inds)
        n = int(train_proportion*len(value_inds))

        train_inds[value_inds[:n]]=True
        test_inds[value_inds[n:]]=True

    return train_inds,test_inds

df[train_inds]和df[test_inds]为您提供原始数据帧df的训练和测试集。

这是我在需要拆分数据帧时编写的内容。我考虑使用上面Andy的方法,但不喜欢我不能精确控制数据集的大小(即,有时是79,有时是81,等等)


这是我在需要拆分数据帧时写的。我考虑使用上面Andy的方法,但不喜欢我不能精确控制数据集的大小(即,有时是79,有时是81,等等)


我将使用scikit learn自己的训练测试分割,并从索引生成它

from sklearn.model_selection import train_test_split


y = df.pop('output')
X = df

X_train,X_test,y_train,y_test = train_test_split(X.index,y,test_size=0.2)
X.iloc[X_train] # return dataframe train

我将使用scikit learn自己的训练测试分割,并从索引生成它

from sklearn.model_selection import train_test_split


y = df.pop('output')
X = df

X_train,X_test,y_train,y_test = train_test_split(X.index,y,test_size=0.2)
X.iloc[X_train] # return dataframe train

如果您希望输入一个数据帧,输出两个数据帧(而不是numpy阵列),那么应该这样做:

def split_data(df, train_perc = 0.8):

   df['train'] = np.random.rand(len(df)) < train_perc

   train = df[df.train == 1]

   test = df[df.train == 0]

   split_data ={'train': train, 'test': test}

   return split_data
def拆分数据(df,列perc=0.8):
df['train']=np.random.rand(len(df))
如果您希望输入一个数据帧,输出两个数据帧(而不是numpy数组),那么这应该可以做到:

def split_data(df, train_perc = 0.8):

   df['train'] = np.random.rand(len(df)) < train_perc

   train = df[df.train == 1]

   test = df[df.train == 0]

   split_data ={'train': train, 'test': test}

   return split_data
def拆分数据(df,列perc=0.8):
df['train']=np.random.rand(len(df))
如果以后要添加列,我认为还需要获取一个副本而不是数据帧的一部分

msk = np.random.rand(len(df)) < 0.8
train, test = df[msk].copy(deep = True), df[~msk].copy(deep = True)
msk=np.random.rand(len(df))<0.8
列车,测试=df[msk]。复制(深=真),df[~msk]。复制(深=真)

如果以后要添加列,我认为还需要获取一个副本而不是数据帧的一部分

msk = np.random.rand(len(df)) < 0.8
train, test = df[msk].copy(deep = True), df[~msk].copy(deep = True)
msk=np.random.rand(len(df))<0.8
列车,测试=df[msk]。复制(深=真),df[~msk]。复制(深=真)

您可以使用df.as_matrix()函数创建Numpy数组并传递它

Y = df.pop()
X = df.as_matrix()
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size = 0.2)
model.fit(x_train, y_train)
model.test(x_test)

您可以使用df.as_matrix()函数创建Numpy数组并传递它

Y = df.pop()
X = df.as_matrix()
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size = 0.2)
model.fit(x_train, y_train)
model.test(x_test)

熊猫随机抽样也会起作用

train=df.sample(frac=0.8,random_state=200) #random state is a seed value
test=df.drop(train.index)

熊猫随机抽样也会起作用

train=df.sample(frac=0.8,random_state=200) #random state is a seed value
test=df.drop(train.index)
这个怎么样? df是我的数据帧

total_size=len(df)

train_size=math.floor(0.66*total_size) (2/3 part of my dataset)

#training dataset
train=df.head(train_size)
#test dataset
test=df.tail(len(df) -train_size)
这个怎么样? df是我的数据帧

total_size=len(df)

train_size=math.floor(0.66*total_size) (2/3 part of my dataset)

#training dataset
train=df.head(train_size)
#test dataset
test=df.tail(len(df) -train_size)

有许多有效的答案。在这群人中再加一个。 从sklearn.cross\u验证导入序列测试\u分割

#gets a random 80% of the entire set
X_train = X.sample(frac=0.8, random_state=1)
#gets the left out portion of the dataset
X_test = X.loc[~df_model.index.isin(X_train.index)]

有许多有效的答案。在这群人中再加一个。 从sklearn.cross\u验证导入序列测试\u分割

#gets a random 80% of the entire set
X_train = X.sample(frac=0.8, random_state=1)
#gets the left out portion of the dataset
X_test = X.loc[~df_model.index.isin(X_train.index)]

只需像这样从df中选择范围行

row_count = df.shape[0]
split_point = int(row_count*1/5)
test_data, train_data = df[:split_point], df[split_point:]

只需像这样从df中选择范围行

row_count = df.shape[0]
split_point = int(row_count*1/5)
test_data, train_data = df[:split_point], df[split_point:]

如果需要根据数据集中的标签列拆分数据,可以使用以下方法:

def split_to_train_test(df, label_column, train_frac=0.8):
    train_df, test_df = pd.DataFrame(), pd.DataFrame()
    labels = df[label_column].unique()
    for lbl in labels:
        lbl_df = df[df[label_column] == lbl]
        lbl_train_df = lbl_df.sample(frac=train_frac)
        lbl_test_df = lbl_df.drop(lbl_train_df.index)
        print '\n%s:\n---------\ntotal:%d\ntrain_df:%d\ntest_df:%d' % (lbl, len(lbl_df), len(lbl_train_df), len(lbl_test_df))
        train_df = train_df.append(lbl_train_df)
        test_df = test_df.append(lbl_test_df)

    return train_df, test_df
并使用它:

train, test = split_to_train_test(data, 'class', 0.7)

如果要控制拆分随机性或使用某些全局随机种子,也可以传递随机状态。

如果需要根据数据集中的标签列拆分数据,可以使用以下方法:

def split_to_train_test(df, label_column, train_frac=0.8):
    train_df, test_df = pd.DataFrame(), pd.DataFrame()
    labels = df[label_column].unique()
    for lbl in labels:
        lbl_df = df[df[label_column] == lbl]
        lbl_train_df = lbl_df.sample(frac=train_frac)
        lbl_test_df = lbl_df.drop(lbl_train_df.index)
        print '\n%s:\n---------\ntotal:%d\ntrain_df:%d\ntest_df:%d' % (lbl, len(lbl_df), len(lbl_train_df), len(lbl_test_df))
        train_df = train_df.append(lbl_train_df)
        test_df = test_df.append(lbl_test_df)

    return train_df, test_df
并使用它:

train, test = split_to_train_test(data, 'class', 0.7)

如果要控制分割随机性或使用某些全局随机种子,也可以传递random_状态。

要分割为两个以上的类,如训练、测试和验证,可以执行以下操作:

probs = np.random.rand(len(df))
training_mask = probs < 0.7
test_mask = (probs>=0.7) & (probs < 0.85)
validatoin_mask = probs >= 0.85


df_training = df[training_mask]
df_test = df[test_mask]
df_validation = df[validatoin_mask]
probs=np.random.rand(len(df))
训练面罩=探头<0.7
测试掩码=(概率>=0.7)和(概率<0.85)
验证掩码=probs>=0.85
df_training=df[训练面具]
df_测试=df[测试_掩码]
df_validation=df[validatoin_mask]

这将使大约70%的数据用于培训,15%用于测试,15%用于验证。

要分为两个以上的类别,如培训、测试和验证,可以执行以下操作:

probs = np.random.rand(len(df))
training_mask = probs < 0.7
test_mask = (probs>=0.7) & (probs < 0.85)
validatoin_mask = probs >= 0.85


df_training = df[training_mask]
df_test = df[test_mask]
df_validation = df[validatoin_mask]
probs=np.random.rand(len(df))
训练面罩=探头<0.7
测试掩码=(概率>=0.7)和(概率<0.85)
验证掩码=probs>=0.85
df_training=df[训练面具]
df_测试=df[测试_掩码]
df_validation=df[validatoin_mask]

这将使大约70%的数据用于培训,15%用于测试,15%用于验证。

您可以使用以下代码创建测试和培训样本:

from sklearn.model_selection import train_test_split
trainingSet, testSet = train_test_split(df, test_size=0.2)

根据要放入测试和训练数据集中的数据百分比,测试大小可能会有所不同。

您可以使用以下代码创建测试和训练样本:

from sklearn.model_selection import train_test_split
trainingSet, testSet = train_test_split(df, test_size=0.2)
import pandas as pd

from sklearn.model_selection import train_test_split

datafile_name = 'path_to_data_file'

data = pd.read_csv(datafile_name)

target_attribute = data['column_name']

X_train, X_test, y_train, y_test = train_test_split(data, target_attribute, test_size=0.8)

测试大小可能会根据您希望放入测试和训练数据集中的数据百分比而有所不同。

根据我的喜好,更优雅的做法是创建一个随机列,然后按其拆分,这样我们可以得到一个适合我们需要的随机拆分

import pandas as pd

from sklearn.model_selection import train_test_split

datafile_name = 'path_to_data_file'

data = pd.read_csv(datafile_name)

target_attribute = data['column_name']

X_train, X_test, y_train, y_test = train_test_split(data, target_attribute, test_size=0.8)
def split_df(df, p=[0.8, 0.2]):
import numpy as np
df["rand"]=np.random.choice(len(p), len(df), p=p)
r = [df[df["rand"]==val] for val in df["rand"].unique()]
return r

对我来说,更优雅一点的做法是创建一个随机列,然后按它进行拆分,这样我们可以得到一个适合我们需要的、随机的拆分

def split_df(df, p=[0.8, 0.2]):
import numpy as np
df["rand"]=np.random.choice(len(p), len(df), p=p)
r = [df[df["rand"]==val] for val in df["rand"].unique()]
return r

有许多方法可以创建训练/测试,甚至验证样本

案例1:经典方式
列车测试\u分割
无任何选项:

from sklearn.model_selection import train_test_split
train, test = train_test_split(df, test_size=0.3)
案例2:ve案例