Python 用numpy编写列测试分割函数_Python_Numpy_Scikit Learn

Python 用numpy编写列测试分割函数

python numpy scikit-learn

Python 用numpy编写列测试分割函数,python,numpy,scikit-learn,Python,Numpy,Scikit Learn,我正在尝试使用numpy编写自己的列车测试分割函数，而不是使用sklearn的列车测试分割函数。我将数据分为70%的训练和30%的测试。我使用的是来自sklearn的波士顿住房数据集这是数据的形状： housing_features.shape #(506,13) where 506 is sample size and it has 13 features. 这是我的代码： city_data = datasets.load_boston() housing_prices = city_d

我正在尝试使用numpy编写自己的列车测试分割函数，而不是使用sklearn的列车测试分割函数。我将数据分为70%的训练和30%的测试。我使用的是来自sklearn的波士顿住房数据集

这是数据的形状：

housing_features.shape #(506,13) where 506 is sample size and it has 13 features.

这是我的代码：

city_data = datasets.load_boston()
housing_prices = city_data.target
housing_features = city_data.data

def shuffle_split_data(X, y):
    split = np.random.rand(X.shape[0]) < 0.7

    X_Train = X[split]
    y_Train = y[split]
    X_Test =  X[~split]
    y_Test = y[~split]

    print len(X_Train), len(y_Train), len(X_Test), len(y_Test)
    return X_Train, y_Train, X_Test, y_Test

try:
    X_train, y_train, X_test, y_test = shuffle_split_data(housing_features, housing_prices)
    print "Successful"
except:
    print "Fail"

但我知道这是不成功的，因为当我再次运行它时，得到的长度数字与仅使用SKlearn的train test功能得到的长度数字不同，X_train的长度总是得到354

#correct output
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(housing_features, housing_prices, test_size=0.3, random_state=42)
print len(X_train) 
#354

我缺少什么函数？

因为您使用的是

np.random.rand

，它提供随机数，对于非常大的数，0.7限制将接近70%。您可以使用该值获得70%的值，然后与该值进行比较，就像您所做的那样：

def shuffle_split_data(X, y):
    arr_rand = np.random.rand(X.shape[0])
    split = arr_rand < np.percentile(arr_rand, 70)

    X_train = X[split]
    y_train = y[split]
    X_test =  X[~split]
    y_test = y[~split]

    print len(X_Train), len(y_Train), len(X_Test), len(y_Test)
    return X_train, y_train, X_test, y_test

另一方面，我应该使用random吗？因为X_序列不应该对应于y_序列值吗？或者，即使在使用random时，该结构是否仍保持不变？@jxn您应该使用random，因为在原始的

train\u test\u split

中，您有

random\u state

，这意味着随机输出。当然，

X\u列

对应于

y\u列

，因为您对它们使用相同的掩码。@jxn或者您可以使用

np.random.choice

def shuffle_split_data(X, y):
    arr_rand = np.random.rand(X.shape[0])
    split = arr_rand < np.percentile(arr_rand, 70)

    X_train = X[split]
    y_train = y[split]
    X_test =  X[~split]
    y_test = y[~split]

    print len(X_Train), len(y_Train), len(X_Test), len(y_Test)
    return X_train, y_train, X_test, y_test

np.random.choice(range(X.shape[0]), int(0.7*X.shape[0]))