Python 两列中基于日期范围的总和_Python_Python 3.x_Pandas_Pandas Groupby

Python 两列中基于日期范围的总和

python python-3.x pandas

Python 两列中基于日期范围的总和,python,python-3.x,pandas,pandas-groupby,Python,Python 3.x,Pandas,Pandas Groupby,我想根据两列中的日期范围，将所有值相加到一列中： Start_Date Value_to_sum End_date 2017-12-13 2 2017-12-13 2017-12-13 3 2017-12-16 2017-12-14 4 2017-12-15 2017-12-15 2 2017-12-15 简单的groupby不会这样做，因为它只会为特定日期添加值我们可以执行Embeded

我想根据两列中的日期范围，将所有值相加到一列中：

Start_Date  Value_to_sum  End_date
2017-12-13    2          2017-12-13
2017-12-13    3          2017-12-16 
2017-12-14    4          2017-12-15
2017-12-15    2          2017-12-15

简单的groupby不会这样做，因为它只会为特定日期添加值

我们可以执行Embeded for循环，但它将永远无法运行：

unique_date = carry.Start_Date.unique()
carry = pd.DataFrame({'Date':unique_date})
carry['total'] = 0
for n in tqdm(range(len(carry))):
    tr = data.loc[data['Start_Date'] >= carry['Date'][n]]
    for i in tr.index:
        if carry['Date'][n] <= tr['End_date'][i]:
                carry['total'][n] += tr['Value_to_sum'][i]

如何根据日期范围计算总和？

不幸的是，我认为没有一种方法可以在不涉及至少一个循环的情况下做到这一点。您正在尝试查看日期是否介于开始日期和结束日期之间。如果是，则要将

值与\u sum

列相加。我们可以使您的循环更有效

您可以为每个唯一的日期创建掩码，并查找符合条件的所有行。然后应用该掩码并获取所有匹配行的总和。这应该比单独迭代每一行并确定要增加哪些日期计数器要快得多

unique_date = df.Start_Date.unique()
for d in unique_date:
    # create a mask which will give us all the rows 
    # that we want to sum over
    # then apply the mask and take the sum of the Value_to_sum column
    m = (df.Start_Date <= d) & (df.End_date >= d)
    print(d, df[m].Value_to_sum.sum())

其他人可能会想出一个聪明的方法来矢量化整件事，但我没有看到这样做的方法。

如果您希望总和成为原始数据帧的一部分，您可以使用apply在每一行上进行迭代（但这可能不是最优化的代码，因为您正在计算每一行的总和）

首先，按[“开始日期”、“结束日期”]分组以保存一些操作

from collections import Counter
c = Counter()
df_g = df.groupby(["Start_Date", "End_date"]).sum().reset_index()

def my_counter(row):
    s, v, e = row.Start_Date, row.Value_to_sum, row.End_date
    if s == e:
        c[pd.Timestamp(s, freq="D")] += row.Value_to_sum
    else:
         c.update({date: v for date in pd.date_range(s, e)})

df_g.apply(my_counter, axis=1) 
print(c)
"""
Counter({Timestamp('2017-12-15 00:00:00', freq='D'): 9,
     Timestamp('2017-12-14 00:00:00', freq='D'): 7,
     Timestamp('2017-12-13 00:00:00', freq='D'): 5,
     Timestamp('2017-12-16 00:00:00', freq='D'): 3})
"""

使用的工具：

计数器更新（[iterable或mapping]）：元素从一个iterable计数，或从另一个映射（或计数器）添加。与dict.update（）类似，但添加计数而不是替换计数。此外，iterable应该是元素序列，而不是（键、值）对序列引自

对不起，我的第一语言是法语，你知道你可以为语言相关问题编辑我的问题：）我已经做了一点。这个变化似乎非常剧烈，我想确定你的意思不是“递归”。你的预期输出是什么？没有问题，对不起，我的错误我已经添加了我的预期输出很好的答案，但由于我的数据很大，我需要快速执行，我怀疑我能否使用你的解决方案：）我们讨论的是多大？您有多少个唯一的日期？另外，您的日期列只是字符串，还是Python datetimes？大约有4000个日期，但前后都有多个进程。这个解决方案已经比我的快了很多，但是没有任何循环将是最好的DateTime，你认为string会更快吗？回答很好，非常干净
2017-12-13 5 2017-12-14 7 2017-12-15 9

carry['total'] = carry.apply(lambda current_row: carry.loc[(carry['Start_Date'] <= current_row.Start_Date) & (carry['End_date'] >= current_row.Start_Date)].Value_to_sum.sum(),axis=1)

>>> print(carry) End_date Start_Date Value_to_sum total 0 2017-12-13 2017-12-13 2 5 1 2017-12-16 2017-12-13 3 5 2 2017-12-15 2017-12-14 4 7 3 2017-12-15 2017-12-15 2 9

from collections import Counter c = Counter() df_g = df.groupby(["Start_Date", "End_date"]).sum().reset_index() def my_counter(row): s, v, e = row.Start_Date, row.Value_to_sum, row.End_date if s == e: c[pd.Timestamp(s, freq="D")] += row.Value_to_sum else: c.update({date: v for date in pd.date_range(s, e)}) df_g.apply(my_counter, axis=1) print(c) """ Counter({Timestamp('2017-12-15 00:00:00', freq='D'): 9, Timestamp('2017-12-14 00:00:00', freq='D'): 7, Timestamp('2017-12-13 00:00:00', freq='D'): 5, Timestamp('2017-12-16 00:00:00', freq='D'): 3}) """