Python 对数据帧中的连续行累积应用操作

Python 对数据帧中的连续行累积应用操作,python,pandas,dataframe,Python,Pandas,Dataframe,我有一个pandasDataFrame,如下所示: sample = pd.DataFrame([[2,3],[4,5],[6,7],[8,9]], index=pd.date_range('2017-08-01','2017-08-04'), columns=['A','B']) A B 2017-08-01 2 3 2017-08-02 4 5 2017-0

我有一个pandas
DataFrame
,如下所示:

sample = pd.DataFrame([[2,3],[4,5],[6,7],[8,9]],
                      index=pd.date_range('2017-08-01','2017-08-04'),
                      columns=['A','B'])

             A   B
2017-08-01   2   3
2017-08-02   4   5
2017-08-03   6   7
2017-08-04   8   9
我想把这些值累加起来。以列
A
为例,第二行变为
2*4
,第三行变为
2*4*6
,最后一行变为
2*4*6*8
。B列也是如此。因此,期望的结果是:

             A    B
2017-08-01   2    3
2017-08-02   8    15
2017-08-03   48   105
2017-08-04   384  945
必须有一些内置的方法来实现这一点,但由于链式分配问题,我甚至在使用for循环时也遇到了问题。

使用


您还可以在值上使用:

sample[:] = np.cumprod(sample.values, axis=0)
print(sample)
              A    B
2017-08-01    2    3
2017-08-02    8   15
2017-08-03   48  105
2017-08-04  384  945

最后,使用(只是为了好玩):

使用:

备选方案:

计时

np.random.seed(334)
N = 2000
df = pd.DataFrame({'A': np.random.choice([1,2], N, p=(0.99, 0.01)),
                   'B':np.random.choice([1,2], N, p=(0.99, 0.01))})
print (df)

In [31]: %timeit (df.cumprod())
The slowest run took 4.32 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 150 µs per loop

In [32]: %timeit (np.cumprod(df))
10000 loops, best of 3: 165 µs per loop

In [33]: %timeit (df.apply(np.cumprod))
The slowest run took 5.51 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 1.23 ms per loop

数据帧有一个名为
cumprod
的方法。您可以按如下方式使用它

sample.cumprod()

不确定重复两个已经完全相同的答案想要达到什么目的。我应该删除它吗?是的。你应该。它没有在现有答案上添加任何内容。我正在键入它,当我按post answer时,提交它花费了很多时间。可能是重复的
print (sample.cumprod())
              A    B
2017-08-01    2    3
2017-08-02    8   15
2017-08-03   48  105
2017-08-04  384  945
print (np.cumprod(sample))
              A    B
2017-08-01    2    3
2017-08-02    8   15
2017-08-03   48  105
2017-08-04  384  945
np.random.seed(334)
N = 2000
df = pd.DataFrame({'A': np.random.choice([1,2], N, p=(0.99, 0.01)),
                   'B':np.random.choice([1,2], N, p=(0.99, 0.01))})
print (df)

In [31]: %timeit (df.cumprod())
The slowest run took 4.32 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 150 µs per loop

In [32]: %timeit (np.cumprod(df))
10000 loops, best of 3: 165 µs per loop

In [33]: %timeit (df.apply(np.cumprod))
The slowest run took 5.51 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 1.23 ms per loop
sample.cumprod()