Python （分层）KFold vs.train_test_split-使用了哪些培训数据？_Python_Train Test Split_K Fold

Python （分层）KFold vs.train_test_split-使用了哪些培训数据？

python

Python （分层）KFold vs.train_test_split-使用了哪些培训数据？,python,train-test-split,k-fold,Python,Train Test Split,K Fold,我只是一个ML的初学者，并试图了解（分层）KFold相对于经典的train_test_split的优势到底是什么经典的train_test_split只使用一部分进行培训（本例中为75%），一部分用于测试（本例中为25%）。在这里，我确切地知道哪些数据点用于培训和测试（参见代码）当使用（分层）Kfold拆分时，我们使用4个拆分，结果是我们有4个不同的培训/测试部分。对于我来说，不清楚这4个部分中的哪一部分将用于逻辑回归的培训/测试。这样设置拆分是否有意义？据我所知，（分层）Kfold的优点是

我只是一个ML的初学者，并试图了解（分层）KFold相对于经典的train_test_split的优势到底是什么

经典的train_test_split只使用一部分进行培训（本例中为75%），一部分用于测试（本例中为25%）。在这里，我确切地知道哪些数据点用于培训和测试（参见代码）

当使用（分层）Kfold拆分时，我们使用4个拆分，结果是我们有4个不同的培训/测试部分。对于我来说，不清楚这4个部分中的哪一部分将用于逻辑回归的培训/测试。这样设置拆分是否有意义？据我所知，（分层）Kfold的优点是您可以使用所有数据进行培训。我必须如何更改代码才能实现这一点

创建数据

import pandas as pd
import numpy as np
target = np.ones(25)
target[-5:] = 0
df = pd.DataFrame({'col_a':np.random.random(25),
                  'target':target})
df

列车测试\u分割


from sklearn.model_selection import train_test_split

X = df.col_a
y = df.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, shuffle=True)
print("TRAIN:", X_train.index, "TEST:", X_test.index)

Output:
TRAIN: Int64Index([1, 13, 8, 9, 21, 12, 10, 4, 20, 19, 7, 5, 15, 22, 24, 17, 11, 23], dtype='int64')
TEST: Int64Index([2, 6, 16, 0, 14, 3, 18], dtype='int64')

分层KFold

from sklearn.model_selection import StratifiedKFold

X = df.col_a
y = df.target

skf = StratifiedKFold(n_splits=4)
for train_index, test_index in skf.split(X, y):
        X_train, X_test = X.loc[train_index], X.loc[test_index]
        y_train, y_test = y.loc[train_index], y.loc[test_index]
        print("TRAIN:", train_index, "TEST:", test_index)

Output: 
TRAIN: [ 5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 22 23 24] TEST: [ 0  1  2  3  4 20 21]
TRAIN: [ 0  1  2  3  4 10 11 12 13 14 15 16 17 18 19 20 21 23 24] TEST: [ 5  6  7  8  9 22]
TRAIN: [ 0  1  2  3  4  5  6  7  8  9 15 16 17 18 19 20 21 22 24] TEST: [10 11 12 13 14 23]
TRAIN: [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 20 21 22 23] TEST: [15 16 17 18 19 24]

使用逻辑回归

from sklearn.linear_model import LogisticRegression

X_train = X_train.reshape(-1, 1)
X_test = X_test.reshape(-1, 1)

clf = LogisticRegression()

clf.fit(X_train, y_train)
clf.predict(X_test)

首先，他们都做同样的事情，但他们如何做才是区别所在

测试列车拆分：

测试序列拆分随机将数据拆分为测试和序列集。除了百分比分割之外，没有其他规则

您只有一个要训练的训练数据和一个要测试模型的测试数据

K折叠：

数据被随机分为测试和列车数据的多个组合。这里唯一的规则是组合的数量

随机分割数据的问题可能导致类别错误陈述，即，测试/列车分割中的一个或多个目标类别比其他类别表现得更多。这可能导致模型训练中出现偏差

为防止出现这种情况，测试和列车拆分必须具有与目标类相同的比例。这可以通过使用分层折叠实现
链接：
如果您喜欢观看视频（从~4.30开始观看）：
旁注：如果您试图使用kfold获得更好的训练，那么将StratifiedKFold与GridSearchCV相结合可能会有所帮助