Python 将具有相同名称的行分组的最佳方法_Python_Pandas_Dataframe_Data Science

Python 将具有相同名称的行分组的最佳方法

python pandas dataframe

Python 将具有相同名称的行分组的最佳方法,python,pandas,dataframe,data-science,Python,Pandas,Dataframe,Data Science,我有一个df： gene person allele allele2 A1 p1 G C A2 p1 A C A3 p1 A T A1 p2 G C A2 p2 T T A3 p2 G C A4 p2 A T

我有一个df：

gene  person  allele    allele2
A1      p1       G          C
A2      p1       A          C
A3      p1       A          T
A1      p2       G          C
A2      p2       T          T
A3      p2       G          C
A4      p2       A          T
A2      p1       G          C
A3      p1       C          C
...

正如你所看到的，在表中，我可以有同一个人几次（从不同的实验室记录）。第一个p1和第二个p1是不同的样本，我只需要选择具有最佳分数（最高行数）的唯一样本，所以这个例子就是第一个p1，因为它有3个，而另一个有2个

我不知道如何提取该表以得到如下结果：

gene  person  allele    allele2
A1      p1       G          C
A2      p1       A          C
A3      p1       A          T
A1      p2       G          C
A2      p2       T          T
A3      p2       G          C
A4      p2       A          T
...

我正在考虑通过for循环对其进行索引。例如，如果person==高于person，则添加到索引i。如果不是，则i+1。然后我会有一个小组。但是整个df有3mln行，所以在开始之前，我决定在这里描述我的问题。也许这是更好的方法？

通过比较和创建连续的组，然后通过以下方式计数：

最后通过每个

人的最大值与系列的s
进行比较：
编辑：如果需要相同大小的第一组，例如此处的组p1
具有相同长度的2倍：
#added last row for another data test
print (df)
  gene person allele allele2
0   A1     p1      G       C
1   A2     p1      A       C
2   A3     p1      A       T
3   A1     p2      G       C
4   A2     p2      T       T
5   A3     p2      G       C
6   A4     p2      A       T
7   A2     p1      G       C
8   A3     p1      C       C
9   A4     p1      C       C




哇！很好的解释，对我来说很有用！谢谢你！
print (s.groupby(df['person']).transform('max'))
0    3
1    3
2    3
3    4
4    4
5    4
6    4
7    3
8    3
Name: person, dtype: int64

df = df[s.groupby(df['person']).transform('max').eq(s)]
print (df)
  gene person allele allele2
0   A1     p1      G       C
1   A2     p1      A       C
2   A3     p1      A       T
3   A1     p2      G       C
4   A2     p2      T       T
5   A3     p2      G       C
6   A4     p2      A       T

#added last row for another data test
print (df)
  gene person allele allele2
0   A1     p1      G       C
1   A2     p1      A       C
2   A3     p1      A       T
3   A1     p2      G       C
4   A2     p2      T       T
5   A3     p2      G       C
6   A4     p2      A       T
7   A2     p1      G       C
8   A3     p1      C       C
9   A4     p1      C       C

g = df['person'].ne(df['person'].shift()).cumsum()
print (g)
0    1
1    1
2    1
3    2
4    2
5    2
6    2
7    3
8    3
9    3
Name: person, dtype: int32

#same size 3
s = g.map(g.value_counts())
print (s)
0    3
1    3
2    3
3    4
4    4
5    4
6    4
7    3
8    3
9    3
Name: person, dtype: int64

#selected first max index in s
idx = s.groupby(df['person']).idxmax()
print (idx)
person
p1    0
p2    3
Name: person, dtype: int64

#seelcted groups g
print (g.loc[idx])
0    1
3    2
Name: person, dtype: int32

#selected only matched groups
print (g.isin(g.loc[idx]))
0     True
1     True
2     True
3     True
4     True
5     True
6     True
7    False
8    False
9    False
Name: person, dtype: bool

df = df[g.isin(g.loc[idx])]
print (df)
  gene person allele allele2
0   A1     p1      G       C
1   A2     p1      A       C
2   A3     p1      A       T
3   A1     p2      G       C
4   A2     p2      T       T
5   A3     p2      G       C
6   A4     p2      A       T