Python 按时间间隔进行数据聚合
我有一个用Python无法解决的问题,我以前用SQL解决过这个问题,但我在Python方面不如在SQL方面做得好 以下是我的数据示例:Python 按时间间隔进行数据聚合,python,pandas,datetime,intervals,Python,Pandas,Datetime,Intervals,我有一个用Python无法解决的问题,我以前用SQL解决过这个问题,但我在Python方面不如在SQL方面做得好 以下是我的数据示例: desc date_1 date_2 date_3 values 54287171 cc-cc 2018-03-14 2017-07-03 2018-05-21 55 49410141 other-dd 2012-01-18 2017-01-26 2011
desc date_1 date_2 date_3 values
54287171 cc-cc 2018-03-14 2017-07-03 2018-05-21 55
49410141 other-dd 2012-01-18 2017-01-26 2011-12-30 17
37694577 other-dd 2018-07-05 2017-07-25 2018-06-19 9
54051782 other-cc 2014-10-23 2017-11-24 2014-10-31 37
7378464 dd-cc 2016-08-05 2018-05-15 2016-07-22 92
29665541 dd-cc 2011-12-14 2017-08-01 2012-05-01 40
2999878 dd-cc 2018-10-03 2018-04-13 2018-09-17 37
39453869 cc-cc 2015-11-24 2017-09-09 2015-11-21 81
7181109 dd-dd 2018-01-18 2017-11-24 2018-01-15 27
29580865 dd-cc 2017-04-24 2017-09-07 2017-05-04 38
14778957 other-cc 2017-11-02 2017-06-20 2018-06-26 49
32500886 cc-dd 2017-01-12 2017-05-26 2017-01-12 50
52146154 other-cc 2018-08-01 2017-03-27 2018-07-16 5
7208584 cc-dd 2018-03-13 2018-07-04 2018-04-26 8
35894666 cc-cc 2017-12-04 2018-06-13 2018-08-14 88
27565108 other-other 2015-10-19 2017-03-14 2016-01-22 88
50705834 other-cc 2018-01-08 2017-12-09 2018-01-11 62
45420360 dd-cc 2017-10-23 2017-09-02 2018-01-29 52
55933497 dd-cc 2017-04-14 2018-06-07 2017-09-27 36
46160680 dd-cc 2014-06-05 2018-01-16 2016-01-27 87
简而言之,我正在尝试重新创建以下功能:
SUM(CASE
WHEN date_1 <= date_2 - interval '11' month
AND date_3 > date_2 - interval '11' month
THEN values
end)
我希望我正确理解你的问题 是,groupby按一列或多列的属性分组。您可以按日期2和/或按描述和/或任何其他您喜欢的列进行分组。 您可以定义条件,将它们保存在数据框中,然后也按这些条件分组。在您的情况下,条件将查询日期\ 1是否比日期\ 2早至少11个月。最棘手的是11个月的时间差。实现这一点的简单方法是使用numpy.timedelta6411,'M'。 一个潜在的问题是,timedelta函数将其解析为通用的时间距离,并且在数月内不保留名称。这可能是有问题的,因为不同的月份并不一样长。如果你只关心几个月,考虑只从某个引用时间存储月份。 提供示例的脚本:
""" Create an example dataset """
import numpy as np
import pandas as pd
df = pd.DataFrame(columns=["desc", "date_1", "date_2", "date_3", "values"])
df.loc["54287171"] = ["cc-cc", pd.Timestamp("2018-03-14"), pd.Timestamp("2017-07-03"), pd.Timestamp("2018-05-21"), 55]
df.loc["49410141"] = ["other-dd", pd.Timestamp("2012-01-18"), pd.Timestamp("2017-01-26"), pd.Timestamp("2011-12-30"), 17]
df.loc["37694577"] = ["other-dd", pd.Timestamp("2018-07-05"), pd.Timestamp("2017-07-25"), pd.Timestamp("2018-06-19"), 9]
df.loc["54051782"] = ["other-cc", pd.Timestamp("2014-10-23"), pd.Timestamp("2017-11-24"), pd.Timestamp("2014-10-31"), 37]
df.loc["7378464"] = ["dd-cc", pd.Timestamp("2016-08-05"), pd.Timestamp("2018-05-15"), pd.Timestamp("2016-07-22"), 92]
df.loc["29665541"] = ["dd-cc", pd.Timestamp("2011-12-14"), pd.Timestamp("2017-08-01"), pd.Timestamp("2012-05-01"), 40]
df.loc["2999878"] = ["dd-cc", pd.Timestamp("2018-10-03"), pd.Timestamp("2018-04-13"), pd.Timestamp("2018-09-17"), 37]
df.loc["39453869"] = ["cc-cc", pd.Timestamp("2015-11-24"), pd.Timestamp("2017-09-09"), pd.Timestamp("2015-11-21"), 81]
df.loc["7181109"] = ["dd-dd", pd.Timestamp("2018-01-18"), pd.Timestamp("2017-11-24"), pd.Timestamp("2018-01-15"), 27]
df.loc["29580865"] = ["dd-cc", pd.Timestamp("2017-04-24"), pd.Timestamp("2017-09-07"), pd.Timestamp("2017-05-04"), 38]
df.loc["14778957"] = ["other-cc", pd.Timestamp("2017-11-02"), pd.Timestamp("2017-06-20"), pd.Timestamp("2018-06-26"), 49]
df.loc["32500886"] = ["cc-dd", pd.Timestamp("2017-01-12"), pd.Timestamp("2017-05-26"), pd.Timestamp("2017-01-12"), 50]
df.loc["52146154"] = ["other-cc", pd.Timestamp("2018-08-01"), pd.Timestamp("2017-03-27"), pd.Timestamp("2018-07-16"), 5]
df.loc["7208584"] = ["cc-dd", pd.Timestamp("2018-03-13"), pd.Timestamp("2018-07-04"), pd.Timestamp("2018-04-26"), 8]
df.loc["35894666"] = ["cc-cc", pd.Timestamp("2017-12-04"), pd.Timestamp("2018-06-13"), pd.Timestamp("2018-08-14"), 88]
df.loc["50705834"] = ["other-cc", pd.Timestamp("2018-01-08"), pd.Timestamp("2017-12-09"), pd.Timestamp("2018-01-11"), 62]
df.loc["45420360"] = ["dd-cc", pd.Timestamp("2017-10-23"), pd.Timestamp("2017-09-02"), pd.Timestamp("2018-01-29"), 52]
df.loc["55933497"] = ["dd-cc", pd.Timestamp("2017-04-14"), pd.Timestamp("2018-06-07"), pd.Timestamp("2017-09-27"), 36]
df.loc["46160680"] = ["dd-cc", pd.Timestamp("2014-06-05"), pd.Timestamp("2018-01-16"), pd.Timestamp("2016-01-27"), 87]
"""Question 1: Yes, groupby() groups by properties for one or more columns"""
df.groupby(["desc"]).sum()
# values
#desc
#cc-cc 224
#cc-dd 58
#dd-cc 382
#dd-dd 27
#other-cc 153
#other-dd 26
"""Question 2: You can define conditions, save them in the dataframe, then group by those too."""
df["condition_1"] = df["date_2"] >= df["date_1"] + pd.Timedelta(np.timedelta64(11, 'M'))
df["condition_2"] = df["date_3"] >= df["date_2"] + pd.Timedelta(np.timedelta64(11, 'M'))
df.groupby(["desc", "condition_1", "condition_2"]).sum()
#
#desc condition_1 condition_2
#cc-cc False False 143
# True False 81
#cc-dd False False 58
#dd-cc False False 127
# True False 255
#dd-dd False False 27
#other-cc False False 62
# True 54
# True False 37
#other-dd False False 9
# True False 17
好的,使用下面的答案和工作人员的帮助,我尝试了几种不同的选择,这是我们想出的最简洁的解决方案
from dateutil.relativedelta import relativedelta
for i in np.arange(-12,12,1):
df['Month_' + str(i)] = df.apply(lambda x: x['values']
if (x['date_2'] <= x['date_1'] + relativedelta(months=i)) \
& (x['date_3'] > x['date_2'] + relativedelta(months=i))
else 0, axis=1)
最后一部分是两个字段上的一个简单groupby,按总和进行聚合。i、 e.gf.groupby['field_1','field_2'].sum谢谢-现在就试试吧。如果我想使用多个条件,这会起作用吗?df[condition_1]=df[date_2]>=df[date_1]+pd.Timedeltanp.timedelta6411,'M'和df[date_3]>df[date_1]+pd.Timedeltanp.timedelta6411,'M'
from dateutil.relativedelta import relativedelta
for i in np.arange(-12,12,1):
df['Month_' + str(i)] = df.apply(lambda x: x['values']
if (x['date_2'] <= x['date_1'] + relativedelta(months=i)) \
& (x['date_3'] > x['date_2'] + relativedelta(months=i))
else 0, axis=1)