时间窗口中相同ID的值之和-Python
我有以下数据帧,它是groupby操作的结果:时间窗口中相同ID的值之和-Python,python,pandas,Python,Pandas,我有以下数据帧,它是groupby操作的结果: df = pd.DataFrame(np.array([ ['ID1','2019-09-06',1], ['ID1','2019-09-11',1], ['ID1','2019-09-25',2], ['ID1','2019-09-27',1], ['ID1','2019-10-21',1], ['ID2','2019-10-15',1], ['ID2','2019-10-17',3],
df = pd.DataFrame(np.array([
['ID1','2019-09-06',1],
['ID1','2019-09-11',1],
['ID1','2019-09-25',2],
['ID1','2019-09-27',1],
['ID1','2019-10-21',1],
['ID2','2019-10-15',1],
['ID2','2019-10-17',3],
['ID2','2019-10-19',2],
['ID2','2019-11-09',1],
]), columns = ["id", "date", "value"])
我想得到一个时间窗口中相同ID的值列的总和,该时间窗口以日期结束。7d窗口的预期输出为:
expected = pd.DataFrame(np.array([
['ID1','2019-09-06',1,1],
['ID1','2019-09-11',1,2],
['ID1','2019-09-25',2,2],
['ID1','2019-09-27',1,3],
['ID1','2019-10-21',1,1],
['ID2','2019-10-15',1,1],
['ID2','2019-10-17',3,4],
['ID2','2019-10-19',2,6],
['ID2','2019-11-09',1,1],
]), columns = ["id", "date", "value", "sum of values in 7d"])
我已经有了一个适用于这种情况的代码,但它并不是最干净的解决方案。而且,当你有成千上万条线的时候,它真的很慢,这就是我的情况
我的职能是:
def countPeriod(i, d, df, p = 7):
# Slices a subset with only the corresponding ID
aux = df[df["id"] == i]
# Gets the period
period = pd.date_range(end = d, periods = p)
# Sums "value" for each date in period
s = 0
for d in period:
try:
aux_date = aux[aux["date"] == d]
s = s + (aux_date["value"].sum())
except:
pass
return s
当我使用以下命令调用它时,它返回预期值:
result = df.copy()
result["date"] = pd.to_datetime(result["date"], dayfirst = True, format = "%Y-%m-%d").dt.date
result["value"] = result["value"].astype(int)
result["sum of values in 7d"] = result.apply(lambda x: countPeriod(x["id"], x["date"], result, 7), axis = 1)
我想不出其他的方法来做到这一点。我甚至看了一眼,但它似乎也不适合我的问题
有没有更干净、更快的方法?我不得不用我正在处理的数据做很多次这样的操作。我只考虑了熊猫的基本特征。 1.使用Grouper创建包含日期以外的记录的DF 2.将原始DF与创建的DF组合 3.7天添加一列。 4.在循环过程中计算7天的判断 5.连接到空DF 我不确定处理速度
r-初学者发布的一个很好的例子让我想到了解决这个问题的另一个方法 在数据集上运行的单个函数,而不是使用.apply,但我使用.rolling作为更容易获得结果的方法
def countPeriod3(df, p = 7):
# Gets the IDs
ids = df['id'].unique()
# Creates an empty dataframe to hold all slices
result = pd.DataFrame(columns = ['id', 'date', 'value', 'sum7d'])
for i in ids:
# Slices a subset with only the corresponding ID
df_slice = df[df["id"] == i]
# Sets date as index to use asfreq
aux = df_slice.set_index('date')
# Uses a daily frequency to fill the gaps and fills NA accordingly
aux = aux.asfreq('d')
aux['id'] = aux["id"].fillna(i)
aux['value'] = aux['value'].fillna(0)
# Rolling window to sum 7 days
aux['sum7d'] = aux['value'].rolling(p, min_periods=1).sum()
# Puts date back as a column
aux.reset_index()
# Deletes redundant columns
aux = aux.drop(columns = ['id', 'value'])
# Gets only the lines that appear on the slice
df_slice = df_slice.merge(aux, on='date', how='left')
# Puts all slices together
result = pd.concat([result,df_slice], ignore_index = True)
return result
事实上,我将所有代码的运行时间与timeit进行了比较,得出以下结果:
第一个代码=43.2毫秒
r-初学者代码=26.6毫秒
上述代码=11.7
太太
再次感谢r初学者的帮助 您的代码运行得非常好,比我以前的选项更快。你的代码运行了26.6毫秒,而我的代码运行了43.2毫秒。此外,你的贡献让我想到了第三种方法,而且速度更快。如果你感兴趣的话,我会把它贴在下面。很多人想知道它跑得有多快。你能减少40%吗?我很高兴能帮上忙。
def countPeriod3(df, p = 7):
# Gets the IDs
ids = df['id'].unique()
# Creates an empty dataframe to hold all slices
result = pd.DataFrame(columns = ['id', 'date', 'value', 'sum7d'])
for i in ids:
# Slices a subset with only the corresponding ID
df_slice = df[df["id"] == i]
# Sets date as index to use asfreq
aux = df_slice.set_index('date')
# Uses a daily frequency to fill the gaps and fills NA accordingly
aux = aux.asfreq('d')
aux['id'] = aux["id"].fillna(i)
aux['value'] = aux['value'].fillna(0)
# Rolling window to sum 7 days
aux['sum7d'] = aux['value'].rolling(p, min_periods=1).sum()
# Puts date back as a column
aux.reset_index()
# Deletes redundant columns
aux = aux.drop(columns = ['id', 'value'])
# Gets only the lines that appear on the slice
df_slice = df_slice.merge(aux, on='date', how='left')
# Puts all slices together
result = pd.concat([result,df_slice], ignore_index = True)
return result