Python 用numpy编写列测试分割函数

Python 用numpy编写列测试分割函数,python,numpy,scikit-learn,Python,Numpy,Scikit Learn,我正在尝试使用numpy编写自己的列车测试分割函数,而不是使用sklearn的列车测试分割函数。我将数据分为70%的训练和30%的测试。我使用的是来自sklearn的波士顿住房数据集 这是数据的形状: housing_features.shape #(506,13) where 506 is sample size and it has 13 features. 这是我的代码: city_data = datasets.load_boston() housing_prices = city_d

我正在尝试使用numpy编写自己的列车测试分割函数,而不是使用sklearn的列车测试分割函数。我将数据分为70%的训练和30%的测试。我使用的是来自sklearn的波士顿住房数据集

这是数据的形状:

housing_features.shape #(506,13) where 506 is sample size and it has 13 features.
这是我的代码:

city_data = datasets.load_boston()
housing_prices = city_data.target
housing_features = city_data.data

def shuffle_split_data(X, y):
    split = np.random.rand(X.shape[0]) < 0.7

    X_Train = X[split]
    y_Train = y[split]
    X_Test =  X[~split]
    y_Test = y[~split]

    print len(X_Train), len(y_Train), len(X_Test), len(y_Test)
    return X_Train, y_Train, X_Test, y_Test

try:
    X_train, y_train, X_test, y_test = shuffle_split_data(housing_features, housing_prices)
    print "Successful"
except:
    print "Fail"
但我知道这是不成功的,因为当我再次运行它时,得到的长度数字与仅使用SKlearn的train test功能得到的长度数字不同,X_train的长度总是得到354

#correct output
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(housing_features, housing_prices, test_size=0.3, random_state=42)
print len(X_train) 
#354 

我缺少什么函数?

因为您使用的是
np.random.rand
,它提供随机数,对于非常大的数,0.7限制将接近70%。您可以使用该值获得70%的值,然后与该值进行比较,就像您所做的那样:

def shuffle_split_data(X, y):
    arr_rand = np.random.rand(X.shape[0])
    split = arr_rand < np.percentile(arr_rand, 70)

    X_train = X[split]
    y_train = y[split]
    X_test =  X[~split]
    y_test = y[~split]

    print len(X_Train), len(y_Train), len(X_Test), len(y_Test)
    return X_train, y_train, X_test, y_test

另一方面,我应该使用random吗?因为X_序列不应该对应于y_序列值吗?或者,即使在使用random时,该结构是否仍保持不变?@jxn您应该使用random,因为在原始的
train\u test\u split
中,您有
random\u state
,这意味着随机输出。当然,
X\u列
对应于
y\u列
,因为您对它们使用相同的掩码。@jxn或者您可以使用
np.random.choice
def shuffle_split_data(X, y):
    arr_rand = np.random.rand(X.shape[0])
    split = arr_rand < np.percentile(arr_rand, 70)

    X_train = X[split]
    y_train = y[split]
    X_test =  X[~split]
    y_test = y[~split]

    print len(X_Train), len(y_Train), len(X_Test), len(y_Test)
    return X_train, y_train, X_test, y_test
np.random.choice(range(X.shape[0]), int(0.7*X.shape[0]))