Python 按组洗牌数据帧_Python_Pandas_Dataframe_Shuffle

Python 按组洗牌数据帧

python pandas dataframe

Python 按组洗牌数据帧,python,pandas,dataframe,shuffle,Python,Pandas,Dataframe,Shuffle,我的数据框看起来像这样 sampleID col1 col2 1 1 63 1 2 23 1 3 73 2 1 20 2 2 94 2 3 99 3 1 73 3 2 56 3 3 34 sampleID col1 col2 2 1 20 2

我的数据框看起来像这样

sampleID  col1 col2
   1        1   63
   1        2   23
   1        3   73
   2        1   20
   2        2   94
   2        3   99
   3        1   73
   3        2   56
   3        3   34

sampleID  col1 col2
   2        1   20
   2        2   94
   2        3   99
   3        1   73
   3        2   56
   3        3   34
   1        1   63
   1        2   23
   1        3   73

我需要洗牌数据帧，保持相同的样本在一起，col1的顺序必须与上面的数据帧中的顺序相同

所以我需要这样

sampleID  col1 col2
   1        1   63
   1        2   23
   1        3   73
   2        1   20
   2        2   94
   2        3   99
   3        1   73
   3        2   56
   3        3   34

sampleID  col1 col2
   2        1   20
   2        2   94
   2        3   99
   3        1   73
   3        2   56
   3        3   34
   1        1   63
   1        2   23
   1        3   73

我该怎么做？如果我的例子不清楚，请让我知道

假设您想通过

sampleID

进行洗牌。首先

df.groupby

，洗牌（

import random

First），然后调用

pd.concat

：

import random

groups = [df for _, df in df.groupby('sampleID')]
random.shuffle(groups)

pd.concat(groups).reset_index(drop=True)

   sampleID  col1  col2
0         2     1    20
1         2     2    94
2         2     3    99
3         1     1    63
4         1     2    23
5         1     3    73
6         3     1    73
7         3     2    56
8         3     3    34

您可以使用df.reset_index（drop=True）重置索引，但这是一个可选步骤。

只需在@cs95 answer中添加一项内容即可。如果您想按
sampleID
洗牌，但您想让您的
sampleID
从1开始排序。因此这里

sampleID

没有那么重要。这里是一个解决方案，您只需迭代gourped数据帧并更改

sampleID

groups = [df for _, df in df.groupby('doc_id')]

random.shuffle(groups)

for i, df in enumerate(groups):
     df['doc_id'] = i+1

shuffled = pd.concat(groups).reset_index(drop=True)

        doc_id  sent_id  word_id
   0       1        1       20
   1       1        2       94
   2       1        3       99
   3       2        1       63
   4       2        2       23
   5       2        3       73
   6       3        1       73
   7       3        2       56
   8       3        3       34

我发现这比公认的答案要快得多：

ids=df[“sampleID”].unique（）
随机洗牌（ids）
df=df.set_index（“sampleID”）.loc[id].reset_index（）

由于某种原因，

pd.concat

是我用例中的瓶颈。不管怎样，你都可以避免连接。

不应该是np.random.shuffle（groups）？@agcala random.shuffle更适合于对象列表（dfs）。在

[df for u，df in df.groupby（'sampleID'）]中的df是什么？

[df for df in df.groupby（'sampleID'）]没有达到同样的效果吗？@AMerii在Groouby上迭代会产生（索引，组）的元组。因为我们不需要索引，所以我们可以使用“不关心var”来分配它，而不做任何处理。