Python 计算具有多个索引列的行之间的差异
我有一个dataframe,其中一列表示时间,其他列表示键的其他部分Python 计算具有多个索引列的行之间的差异,python,pandas,Python,Pandas,我有一个dataframe,其中一列表示时间,其他列表示键的其他部分 df = pd.DataFrame(data=[(t, l1, l2, t * t * (1 + l2 + l1)) for t in range(3) for l1 in [3, 4] for l2 in [10, 100]], colum
df = pd.DataFrame(data=[(t, l1, l2, t * t * (1 + l2 + l1))
for t in range(3)
for l1 in [3, 4]
for l2 in [10, 100]],
columns=['t', 'l1', 'l2', 'x'])
t l1 l2 x
0 0 3 10 0
1 0 3 100 0
2 0 4 10 0
3 0 4 100 0
4 1 3 10 14
5 1 3 100 104
6 1 4 10 15
7 1 4 100 105
8 2 3 10 56
9 2 3 100 416
10 2 4 10 60
11 2 4 100 420
我正在查找行的“x”列与前面的值“t”的差异,但是“l1”和“l2”的值相同
t l1 l2 x t.1 delta_x
0 0 3 10 0 1 NaN
1 0 3 100 0 1 NaN
2 0 4 10 0 1 NaN
3 0 4 100 0 1 NaN
4 1 3 10 14 2 14.0
5 1 3 100 104 2 104.0
6 1 4 10 15 2 15.0
7 1 4 100 105 2 105.0
8 2 3 10 56 3 42.0
9 2 3 100 416 3 312.0
10 2 4 10 60 3 45.0
11 2 4 100 420 3 315.0
我可以用下面的代码生成这个框架
df['t.1'] = df.t + 1
df['delta_x'] = df.x - df.merge(df, left_on=['t', 'l1', 'l2'],
right_on=['t.1', 'l1', 'l2'],
how='left',
suffixes=['','.1'])['x.1']
有更干净或更有效的方法吗?试试:
def diff(x):
return x - x.shift()
df['delta_x'] = df.groupby(['l1', 'l2'])['x'].apply(diff)
在我的回答中落实MAXU下面的评论。
您必须在l1
和l2
列上使用,因为您要根据t
列的值的变化来比较这对值(l1,l2
)的x
列的差异
默认情况下,计算按l1和l2
分组的(t=1
)和(t=0
)的值之间的差值,并返回结果。因此,如果您想找到(t=2
)和(t=0
)之间x
值的差异,只需执行diff(periods=2)
最后,使用该方法返回组块的每个组内计算的差异
In [3]: df['delta_x'] = df.groupby(['l1', 'l2'])['x'].transform(lambda x: x.diff())
In [4]: df
Out[4]:
t l1 l2 x delta_x
0 0 3 10 0 NaN
1 0 3 100 0 NaN
2 0 4 10 0 NaN
3 0 4 100 0 NaN
4 1 3 10 14 14.0
5 1 3 100 104 104.0
6 1 4 10 15 15.0
7 1 4 100 105 105.0
8 2 3 10 28 14.0
9 2 3 100 208 104.0
10 2 4 10 30 15.0
11 2 4 100 210 105.0
时间限制:
In [5]: %timeit df['delta_x'] = df.groupby(['l1', 'l2'])['x'].transform(lambda x: x.diff())
1000 loops, best of 3: 1.55 ms per loop
In [17]: %timeit df['delta_x'] = df.x - df.merge(df, left_on=['t', 'l1', 'l2'], right_on=['t.1', 'l1', 'l2'],how='left',suffixes=['','.1'])['x.1']
100 loops, best of 3: 3.33 ms per loop
我的第一个想法是
df.groupby(['l1', 'l2'])['x'].diff()
有趣的是,尼基尔的方法似乎更快
import pandas as pd
import timeit
df = pd.DataFrame(data=[(t, l1, l2, t * t * (1 + l2 + l1))
for t in range(3)
for l1 in [3, 4]
for l2 in [10, 100]],
columns=['t', 'l1', 'l2', 'x'])
N = 1000
t = timeit.timeit("df.groupby(['l1', 'l2'])['x'].transform(lambda x: x.diff())", setup='from __main__ import df', number=N)
print(t) # 1.8100
def diff(x):
return x - x.shift()
t = timeit.timeit("df.groupby(['l1', 'l2'])['x'].apply(diff)", setup='from __main__ import df, diff', number=N)
print(t) # 2.6829
t = timeit.timeit("df.groupby(['l1', 'l2'])['x'].diff()", setup='from __main__ import df', number=N)
print(t) # 2.5710
对于未来的开发人员来说,一种可能更简单、更容易遵循和修改的方法是将数据帧按(l1、l2)的值分开,进行系列计算,然后重新合并。如果您不需要自己的
diff()
函数和缓慢的apply()
方法,只需使用标准的df.groupby(['l1',l2'])['x'].diff()
import pandas as pd
import timeit
df = pd.DataFrame(data=[(t, l1, l2, t * t * (1 + l2 + l1))
for t in range(3)
for l1 in [3, 4]
for l2 in [10, 100]],
columns=['t', 'l1', 'l2', 'x'])
N = 1000
t = timeit.timeit("df.groupby(['l1', 'l2'])['x'].transform(lambda x: x.diff())", setup='from __main__ import df', number=N)
print(t) # 1.8100
def diff(x):
return x - x.shift()
t = timeit.timeit("df.groupby(['l1', 'l2'])['x'].apply(diff)", setup='from __main__ import df, diff', number=N)
print(t) # 2.6829
t = timeit.timeit("df.groupby(['l1', 'l2'])['x'].diff()", setup='from __main__ import df', number=N)
print(t) # 2.5710