Python:计算系列中值的累计出现次数

Python:计算系列中值的累计出现次数,python,pandas,Python,Pandas,我有一个如下所示的数据帧: fruit 0 orange 1 orange 2 orange 3 pear 4 orange 5 apple 6 apple 7 pear 8 pear 9 orange df['cum_count'] = [(df.fruit[0:i+1] == x).sum() for i, x in df.fruit.iteritems()] 我想添加一列,统计每个值的累计出现次数,即 fruit cum_count

我有一个如下所示的数据帧:

    fruit
0  orange
1  orange
2  orange
3    pear
4  orange
5   apple
6   apple
7    pear
8    pear
9  orange
df['cum_count'] = [(df.fruit[0:i+1] == x).sum() for i, x in df.fruit.iteritems()]
我想添加一列,统计每个值的累计出现次数,即

    fruit  cum_count
0  orange          1
1  orange          2
2  orange          3
3    pear          1
4  orange          4
5   apple          1
6   apple          2
7    pear          2
8    pear          3
9  orange          5
现在我是这样做的:

    fruit
0  orange
1  orange
2  orange
3    pear
4  orange
5   apple
6   apple
7    pear
8    pear
9  orange
df['cum_count'] = [(df.fruit[0:i+1] == x).sum() for i, x in df.fruit.iteritems()]
。。。这对于10行来说很好,但是当我试图对几百万行做同样的事情时,需要很长时间。有没有更有效的方法可以做到这一点?

您可以使用和:

定时

In [8]: %timeit [(df.fruit[0:i+1] == x).sum() for i, x in df.fruit.iteritems()]
100 loops, best of 3: 3.76 ms per loop

In [9]: %timeit df.groupby('fruit').cumcount() + 1
1000 loops, best of 3: 926 µs per loop
因此,它的速度提高了4倍。

与specify column一起使用可能更好,因为这是一种更有效的方法:

df['cum_count'] = df.groupby('fruit' )['fruit'].cumcount() + 1
print df

    fruit  cum_count
0  orange          1
1  orange          2
2  orange          3
3    pear          1
4  orange          4
5   apple          1
6   apple          2
7    pear          2
8    pear          3
9  orange          5
比较
len(df)=10
,我的解决方案是最快的:

In [3]: %timeit df.groupby('fruit')['fruit'].cumcount() + 1
The slowest run took 11.67 times longer than the fastest. This could mean that an intermediate result is being cached 
1000 loops, best of 3: 299 µs per loop

In [4]: %timeit df.groupby('fruit').cumcount() + 1
The slowest run took 12.78 times longer than the fastest. This could mean that an intermediate result is being cached 
1000 loops, best of 3: 921 µs per loop

In [5]: %timeit [(df.fruit[0:i+1] == x).sum() for i, x in df.fruit.iteritems()]
The slowest run took 4.47 times longer than the fastest. This could mean that an intermediate result is being cached 
100 loops, best of 3: 2.72 ms per loop
比较
len(df)=10k

In [7]: %timeit df.groupby('fruit')['fruit'].cumcount() + 1
The slowest run took 4.65 times longer than the fastest. This could mean that an intermediate result is being cached 
1000 loops, best of 3: 845 µs per loop

In [8]: %timeit df.groupby('fruit').cumcount() + 1
The slowest run took 5.59 times longer than the fastest. This could mean that an intermediate result is being cached 
100 loops, best of 3: 1.59 ms per loop

In [9]: %timeit [(df.fruit[0:i+1] == x).sum() for i, x in df.fruit.iteritems()]
1 loops, best of 3: 5.12 s per loop

抱歉,为什么要在计时时使用我的解决方案?请添加解决方案或删除计时。@jezrael是的,对不起。复印错误。@Li Wen Yip抱歉,我检查了你的个人资料,也许你忘记回答了。我认为这很重要。因此,您可以选择my或anton解决方案并接受其中一个。谢谢