Python 将上一个时段的数据设置为新的DataFrame列
我有一个数据框:Python 将上一个时段的数据设置为新的DataFrame列,python,pandas,Python,Pandas,我有一个数据框: import pandas as pd df = pd.DataFrame([['A', '2014-01-01', '2014-01-07', 1.2], ['B', '2014-01-01', '2014-01-07', 2.5], ['C', '2014-01-01', '2014-01-07', 3.], ['A', '2014-01-08', '2014
import pandas as pd
df = pd.DataFrame([['A', '2014-01-01', '2014-01-07', 1.2],
['B', '2014-01-01', '2014-01-07', 2.5],
['C', '2014-01-01', '2014-01-07', 3.],
['A', '2014-01-08', '2014-01-14', 13.],
['B', '2014-01-08', '2014-01-14', 2.],
['C', '2014-01-08', '2014-01-14', 1.],
['A', '2014-01-15', '2014-01-21', 10.],
['A', '2014-01-21', '2014-01-27', 98.],
['B', '2014-01-21', '2014-01-27', -5.],
['C', '2014-01-21', '2014-01-27', -72.],
['A', '2014-01-22', '2014-01-28', 8.],
['B', '2014-01-22', '2014-01-28', 25.],
['C', '2014-01-22', '2014-01-28', -23.],
['A', '2014-01-22', '2014-02-22', 8.],
['B', '2014-01-22', '2014-02-22', 25.],
['C', '2014-01-22', '2014-02-22', -23.],
], columns=['Group', 'Start Date', 'End Date', 'Value'])
import pandas as pd
df = pd.DataFrame([['A', '2014-01-01', '2014-01-07', 1.2],
['B', '2014-01-01', '2014-01-07', 2.5],
['C', '2014-01-01', '2014-01-07', 3.],
['A', '2014-01-08', '2014-01-14', 3.],
['B', '2014-01-08', '2014-01-14', 2.],
['C', '2014-01-08', '2014-01-14', 1.],
['A', '2014-01-15', '2014-01-21', 10.],
['A', '2014-01-21', '2014-01-27', 98.],
['B', '2014-01-21', '2014-01-27', -5.],
['C', '2014-01-21', '2014-01-27', -72.],
['A', '2014-01-22', '2014-01-28', 8.],
['B', '2014-01-22', '2014-01-28', 25.],
['C', '2014-01-22', '2014-01-28', -23.],
['A', '2014-01-22', '2014-02-22', 8.],
['B', '2014-01-22', '2014-02-22', 25.],
['C', '2014-01-22', '2014-02-22', -23.],
], columns=['Group', 'Start', 'End', 'Value'])
for col in ['Start', 'End']:
df[col] = pd.to_datetime(df[col])
df['duration'] = df['End']-df['Start']
df['Prev'] = df['Start'] - df['duration'] - pd.Timedelta(days=1)
result = pd.merge(df, df[['Group','duration','Start','Value']], how='left',
left_on=['Group','duration','Prev'],
right_on=['Group','duration','Start'], suffixes=['', '_y'])
result = result[['Group', 'Start', 'End', 'Value', 'Value_y']]
result = result.rename(columns={'Value_y':'Prev Value'})
print(result)
输出如下所示:
Group Start Date End Date Value
0 A 2014-01-01 2014-01-07 1.2
1 B 2014-01-01 2014-01-07 2.5
2 C 2014-01-01 2014-01-07 3.0
3 A 2014-01-08 2014-01-14 13.0
4 B 2014-01-08 2014-01-14 2.0
5 C 2014-01-08 2014-01-14 1.0
6 A 2014-01-15 2014-01-21 10.0
7 A 2014-01-21 2014-01-27 98.0
8 B 2014-01-21 2014-01-27 -5.0
9 C 2014-01-21 2014-01-27 -72.0
10 A 2014-01-22 2014-01-28 8.0
11 B 2014-01-22 2014-01-28 25.0
12 C 2014-01-22 2014-01-28 -23.0
13 A 2014-01-22 2014-02-22 8.0
14 B 2014-01-22 2014-02-22 25.0
15 C 2014-01-22 2014-02-22 -23.0
Group Start Date End Date Value Last Period Value
0 A 2014-01-01 2014-01-07 1.2 NaN
1 B 2014-01-01 2014-01-07 2.5 NaN
2 C 2014-01-01 2014-01-07 3.0 NaN
3 A 2014-01-08 2014-01-14 13.0 1.2
4 B 2014-01-08 2014-01-14 2.0 2.5
5 C 2014-01-08 2014-01-14 1.0 3.0
6 A 2014-01-15 2014-01-21 10.0 13.0
7 A 2014-01-21 2014-01-27 98.0 NaN
8 B 2014-01-21 2014-01-27 -5.0 NaN
9 C 2014-01-21 2014-01-27 -72.0 NaN
10 A 2014-01-22 2014-01-28 8.0 10.0
11 B 2014-01-22 2014-01-28 25.0 NaN
12 C 2014-01-22 2014-01-28 -23.0 NaN
13 A 2014-01-22 2014-02-22 8.0 NaN
14 B 2014-01-22 2014-02-22 25.0 NaN
15 C 2014-01-22 2014-02-22 -23.0 NaN
我正在尝试添加一个新列,其中包含上一期间相同组的数据(如果存在)。因此,输出应如下所示:
Group Start Date End Date Value
0 A 2014-01-01 2014-01-07 1.2
1 B 2014-01-01 2014-01-07 2.5
2 C 2014-01-01 2014-01-07 3.0
3 A 2014-01-08 2014-01-14 13.0
4 B 2014-01-08 2014-01-14 2.0
5 C 2014-01-08 2014-01-14 1.0
6 A 2014-01-15 2014-01-21 10.0
7 A 2014-01-21 2014-01-27 98.0
8 B 2014-01-21 2014-01-27 -5.0
9 C 2014-01-21 2014-01-27 -72.0
10 A 2014-01-22 2014-01-28 8.0
11 B 2014-01-22 2014-01-28 25.0
12 C 2014-01-22 2014-01-28 -23.0
13 A 2014-01-22 2014-02-22 8.0
14 B 2014-01-22 2014-02-22 25.0
15 C 2014-01-22 2014-02-22 -23.0
Group Start Date End Date Value Last Period Value
0 A 2014-01-01 2014-01-07 1.2 NaN
1 B 2014-01-01 2014-01-07 2.5 NaN
2 C 2014-01-01 2014-01-07 3.0 NaN
3 A 2014-01-08 2014-01-14 13.0 1.2
4 B 2014-01-08 2014-01-14 2.0 2.5
5 C 2014-01-08 2014-01-14 1.0 3.0
6 A 2014-01-15 2014-01-21 10.0 13.0
7 A 2014-01-21 2014-01-27 98.0 NaN
8 B 2014-01-21 2014-01-27 -5.0 NaN
9 C 2014-01-21 2014-01-27 -72.0 NaN
10 A 2014-01-22 2014-01-28 8.0 10.0
11 B 2014-01-22 2014-01-28 25.0 NaN
12 C 2014-01-22 2014-01-28 -23.0 NaN
13 A 2014-01-22 2014-02-22 8.0 NaN
14 B 2014-01-22 2014-02-22 25.0 NaN
15 C 2014-01-22 2014-02-22 -23.0 NaN
请注意,具有NaN的行没有具有相同组的对应值,该值位于最后一个句点中。因此,跨越7天(一周)的行需要与具有相同组但来自前一周的相同行匹配。最简单的方法(尽管具有二次复杂性)如下所示:
import datetime as dt
df.sd = pd.to_datetime(df['Start Date'])
df.ed = pd.to_datetime(df['End Date'])
def find_previous_period(row):
prev_sd = row.sd - dt.timedelta(days=7)
prev_ed = row.ed - dt.timedelta(days=7)
prev_period = df[(df.sd == prev_sd) & (df.ed == prev_ed) & (df.Group == row.Group)]
if prev_period.size > 0:
return prev_period.irow(0).Value
df['Last Period Value'] = df.apply(find_previous_period, axis=1)
如果您有大量数据,可能需要一些更优雅的解决方案
更新需要相同天数的要求(来自评论):
如果我理解你对“周期”的定义,这会起作用,而且应该很快
df['sd'] = pd.to_datetime(df['Start Date'])
df['sd2'] = df.sd - dt.timedelta(days=1)
df['ed2'] = df.ed - dt.timedelta(days=1)
df2 = pd.merge(df, df[['sd2','ed2','Value', 'Group']], left_on=['sd','Group', 'ed'],
right_on=['sd2','Group', 'ed2'], how='outer', copy=False)
您必须清理列名/删除额外的列。假设我们为每行计算
Start
和End
之间的持续时间:
df['duration'] = df['End']-df['Start']
假设我们也基于该持续时间计算上一个开始值:
df['Prev'] = df['Start'] - df['duration'] - pd.Timedelta(days=1)
然后,我们可以将所需数据帧表示为df
与自身合并的结果,其中我们合并了组
、持续时间
和上一个
(在一个数据帧中)与组
、持续时间
和开始
(在另一个数据帧中)匹配的行:
屈服
Group Start End Value Prev Value
0 A 2014-01-01 2014-01-07 1.2 NaN
1 B 2014-01-01 2014-01-07 2.5 NaN
2 C 2014-01-01 2014-01-07 3.0 NaN
3 A 2014-01-08 2014-01-14 3.0 1.2
4 B 2014-01-08 2014-01-14 2.0 2.5
5 C 2014-01-08 2014-01-14 1.0 3.0
6 A 2014-01-15 2014-01-21 10.0 3.0
7 A 2014-01-21 2014-01-27 98.0 NaN
8 B 2014-01-21 2014-01-27 -5.0 NaN
9 C 2014-01-21 2014-01-27 -72.0 NaN
10 A 2014-01-22 2014-01-28 8.0 10.0
11 B 2014-01-22 2014-01-28 25.0 NaN
12 C 2014-01-22 2014-01-28 -23.0 NaN
13 A 2014-01-22 2014-02-22 8.0 NaN
14 B 2014-01-22 2014-02-22 25.0 NaN
15 C 2014-01-22 2014-02-22 -23.0 NaN
在评论中,Artur Nowak询问了
pd.merge
的时间复杂性。我相信它是在做一个O(N+M)
hash连接,其中N
是哈希表的大小,M
是查找表的大小。下面是一些代码,用于测试作为数据帧大小函数的pd.merge
的性能
import collections
import string
import timeit
import numpy as np
import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt
timing = collections.defaultdict(list)
def make_df(ngroups, ndur, ndates):
groups = list(string.uppercase[:ngroups])
durations = range(ndur)
start = pd.date_range('2000-1-1', periods=ndates, freq='D')
index = pd.MultiIndex.from_product([start, durations, groups],
names=['Start', 'duration', 'Group'])
values = np.arange(len(index))
df = pd.DataFrame({'Value': values}, index=index).reset_index()
df['End'] = df['Start'] + pd.to_timedelta(df['duration'], unit='D')
df = df.drop('duration', axis=1)
df = df[['Group', 'Start', 'End', 'Value']]
df['duration'] = df['End']-df['Start']
df['Prev'] = df['Start'] - df['duration'] - pd.Timedelta(days=1)
return df
def using_merge(df):
result = pd.merge(df, df[['Group','duration','Start','Value']], how='left',
left_on=['Group','duration','Prev'],
right_on=['Group','duration','Start'], suffixes=['', '_y'])
return result
Ns = np.array([10**i for i in range(5)])
for n in Ns:
timing['merge'].append(timeit.timeit(
'using_merge(df)',
'from __main__ import using_merge, make_df; df = make_df(10, 10, {})'.format(n),
number=5))
print(timing['merge'])
slope, intercept, rval, pval, stderr = stats.linregress(Ns, timing['merge'])
print(slope, intercept, rval, pval, stderr)
plt.plot(Ns, timing['merge'], label='merge')
plt.plot(Ns, slope*Ns + intercept)
plt.legend(loc='best')
plt.show()
这表明对于数万行的数据帧,pd.merge
的速度大致是线性的
如何定义“前期”?周期是否等同于日历周,或者是否可以有任意周期?如果它们始终等于一周,则将时段开始日期转换为周数可能会有所帮助。期间可以是可变的(由天数定义)。因此,行索引#3是7天,在它是行索引#0之前的最后7天期间(对于同一组)。因此,组必须相同,天数必须相同,两个时段必须连续(当前时段的开始日期是最后一个时段结束日期之后的一天)。在使用周数时,周数是不断增加还是从1月1日的1开始?同样,每个周期的长度是可变的,所以我不确定这是否可行。事实上,我有很多数据,所以我希望找到一个比n平方性能更优雅、更快的解决方案。这很接近。两个期间(当前和最后)内的天数也必须相同。这是对连续两个时段的补充。每行的时间差天数应等于(结束-开始+1)。否则,周期将是重叠的,而不是连续的(即相邻的几周)。出于好奇,您是否知道
merge
操作的计算复杂性?我找不到这些信息,我想知道它与按组和持续时间分割数据(定义如您的答案所示)、按时段开始排序然后按顺序查看数据相比有什么不同。@ArturNowak:我相信pd。merge
执行。作为一个实际问题,我认为总是有必要在接近实际用例的数据上对两个版本进行基准测试,以确定哪个更快(对于该用例)。我添加了一些timeit
代码来研究pd.merge
作为数据帧大小的函数的性能。如果您要添加代码来进行拆分/排序/顺序处理,我们可以进行一些实证测试。