Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/310.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 熊猫:根据条件分组并选择一行_Python_Pandas - Fatal编程技术网

Python 熊猫:根据条件分组并选择一行

Python 熊猫:根据条件分组并选择一行,python,pandas,Python,Pandas,我正在寻找一系列函数,这些函数给出如下输入: id label rank aab quz 2 aaa foo 1 aac bar 4 aad foo 4 aac foo 2 aac baz 3 aab baz 3 aaa bar 5 按id分组,并在每组内选择排名最低的记录。输出如下所示: 输出: id label

我正在寻找一系列函数,这些函数给出如下输入:

 id   label    rank
aab   quz         2
aaa   foo         1
aac   bar         4
aad   foo         4
aac   foo         2
aac   baz         3
aab   baz         3
aaa   bar         5
id
分组,并在每组内选择
排名最低的记录。输出如下所示:

输出:

id   label    rank
aaa  foo         1
aab  qaz         3
aac  foo         2
aad  foo         4
假设输入数据是无序的。

我认为您可以按列
id
,应用函数查找列
最小值的行的索引。然后使用来选择这些行:

print df.groupby('id')['rank'].idxmin()
id
aaa    1
aab    0
aac    4
aad    3
Name: rank, dtype: int64

print df.loc[df.groupby('id')['rank'].idxmin(),:]
    id label  rank
1  aaa   foo     1
0  aab   quz     2
4  aac   foo     2
3  aad   foo     4
或:

定时

len(df)=8

In [153]: %timeit df.sort_values('rank').groupby('id').first().reset_index()
The slowest run took 4.30 times longer than the fastest. This could mean that an intermediate result is being cached 
100 loops, best of 3: 2.26 ms per loop

In [154]: %timeit df.loc[df.groupby('id')['rank'].idxmin(),:]
1000 loops, best of 3: 1.67 ms per loop

In [155]: %timeit df.loc[df.groupby('id')['rank'].idxmin()]
1000 loops, best of 3: 1.52 ms per loop
len(df)=8k

In [157]: %timeit df.sort_values('rank').groupby('id').first().reset_index()
100 loops, best of 3: 3.55 ms per loop

In [158]: %timeit df.loc[df.groupby('id')['rank'].idxmin(),:]
100 loops, best of 3: 2.24 ms per loop

In [159]: %timeit df.loc[df.groupby('id')['rank'].idxmin()]
The slowest run took 4.35 times longer than the fastest. This could mean that an intermediate result is being cached 
100 loops, best of 3: 2.12 ms per loop

最简单的方法可能是按等级排序,按id分组,然后选择每组的第一个元素

> df.sort('rank').groupby('id').first().reset_index()

#     id label  rank
# 0  aaa   foo     1
# 1  aab   quz     2
# 2  aac   foo     2
# 3  aad   foo     4

pandas是否保证分组行将保持每个组内的初始排序顺序?@DmitryB。我是这样读的;谢谢你的回答。我最终选择了另一个答案,因为这就是我最终在代码中使用的答案。没问题。如果你想知道,你的问题有两种可能的解决方案。快乐编码:)顺便说一句,最好使用
排序\u值作为
排序
,因为
警告
> df.sort('rank').groupby('id').first().reset_index()

#     id label  rank
# 0  aaa   foo     1
# 1  aab   quz     2
# 2  aac   foo     2
# 3  aad   foo     4