Pandas 如何分割重复样本以进行无重叠的训练测试?

Pandas 如何分割重复样本以进行无重叠的训练测试?,pandas,machine-learning,scikit-learn,pytorch,train-test-split,Pandas,Machine Learning,Scikit Learn,Pytorch,Train Test Split,我有一个nlp数据集(大约300Ksamples),其中存在重复数据。我想将其拆分为训练测试(70%-30%),它们应该没有重叠 例如: |dataset: | train | test | | a | a | c | | a | a | c | | b | b | c | | b

我有一个nlp数据集(大约300Ksamples),其中存在重复数据。我想将其拆分为训练测试(70%-30%),它们应该没有重叠

例如:

|dataset:      |   train   |     test   |
|   a          |     a     |       c    |
|   a          |     a     |       c    |
|   b          |     b     |       c    |
|   b          |     b     |            |
|   b          |     b     |            |
|   c          |     d     |            |
|   c          |     d     |            |
|   c          |           |            |
|   d          |           |            |
|   d          |           |            |

我对随机样本感到厌倦,但它太耗时了。

如果我没有弄错,请尝试以下方法:

train_inds, test_inds = next(GroupShuffleSplit(test_size=.20, n_splits=2, random_state = 7).split(df, groups=df['duplicate_column']))

train = df.iloc[train_inds]
test = df.iloc[test_inds]

这是可行的,但需要完成几个步骤

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

# original dataset with duplicates
dataset = pd.DataFrame(["a", "a", "b", "b", "b", "c", "c", "c", "d", "d"])

# get unique values, remove duplicates, but keep original counts
data_no_dup, counts = np.unique(dataset, return_counts=True)

# split using the standard Scikit-Learn way
train_no_dup, test_no_dup = train_test_split(data_no_dup, test_size=0.2, random_state=0)

# retrieve original counts
train, test = [], []
for sample in train_no_dup:
    train.extend([sample] * counts[list(data_no_dup).index(sample)])
for sample in test_no_dup:
    test.extend([sample] * counts[list(data_no_dup).index(sample)])

print("Train: {}".format(train))
print("Test: {}".format(test))
输出
在拆分之前删除重复数据不是一个选项吗?
Train: ['d', 'd', 'b', 'b', 'b', 'a', 'a']
Test: ['c', 'c', 'c']