Python 在数据框中添加两个日期之间的日期列_Python_Pandas_Algorithm_Data Structures

Python 在数据框中添加两个日期之间的日期列

python pandas algorithm data-structures

Python 在数据框中添加两个日期之间的日期列,python,pandas,algorithm,data-structures,Python,Pandas,Algorithm,Data Structures,我有一个现有的数据帧，看起来像： id start_date end_date 0 1 20170601 20210531 1 2 20181001 20220930 2 3 20150101 20190228 3 4 20171101 20211031 customer_id contract_start_date contract_end_date 01/12 02/12 03/12 04/12 05/

我有一个现有的数据帧，看起来像：

    id  start_date  end_date
0   1   20170601    20210531
1   2   20181001    20220930
2   3   20150101    20190228
3   4   20171101    20211031

customer_id contract_start_date contract_end_date   01/12   02/12   03/12   04/12   05/12   06/12   07/12   ... 04/18   05/18   06/18   07/18   08/18   09/18   10/18   11/18   12/18   01/19
1   1   20181001    20220930    0   0   0   0   0   0   0   ... 0   0   0   0   0   0   1   1   1   1
9   2   20160701    20200731    0   0   0   0   0   0   0   ... 1   1   1   1   1   1   1   1   1   1
3   3   20171101    20211031    0   0   0   0   0   0   0   ... 1   1   1   1   1   1   1   1   1   1
3 rows × 88 columns

我正在尝试向此数据帧添加85列，它们是：

如果月/年（从开始日期循环到结束日期）介于20101和20190101之间：1
其他:0

我尝试了以下方法：

start, end = [datetime.strptime(_, "%Y%m%d") for _ in ['20120101', '20190201']]
global_list = list(OrderedDict(((start + timedelta(_)).strftime(r"%m/%y"), None) for _ in range((end - start).days)).keys())

def get_count(contract_start_date, contract_end_date):
    start, end = [datetime.strptime(_, "%Y%m%d") for _ in [contract_start_date, contract_end_date]]
    current_list = list(OrderedDict(((start + timedelta(_)).strftime(r"%m/%y"), None) for _ in range((end - start).days)).keys())
    temp_list = []
    for each in global_list:
        if each in current_list:
            temp_list.append(1)
        else:
            temp_list.append(0)
    return pd.Series(temp_list)

sample_df[global_list] = sample_df[['contract_start_date', 'contract_end_date']].apply(lambda x: get_count(*x), axis=1)

示例df如下所示：

    id  start_date  end_date
0   1   20170601    20210531
1   2   20181001    20220930
2   3   20150101    20190228
3   4   20171101    20211031

customer_id contract_start_date contract_end_date   01/12   02/12   03/12   04/12   05/12   06/12   07/12   ... 04/18   05/18   06/18   07/18   08/18   09/18   10/18   11/18   12/18   01/19
1   1   20181001    20220930    0   0   0   0   0   0   0   ... 0   0   0   0   0   0   1   1   1   1
9   2   20160701    20200731    0   0   0   0   0   0   0   ... 1   1   1   1   1   1   1   1   1   1
3   3   20171101    20211031    0   0   0   0   0   0   0   ... 1   1   1   1   1   1   1   1   1   1
3 rows × 88 columns

对于小数据集，它可以正常工作，但对于160k行，它甚至在3小时后也没有停止。有人能告诉我更好的方法吗

在同一客户的日期重叠时面临问题。

首先，我要切断无效日期，以规范结束时间（以确保它在时间范围内）：

如果多行在同一个月内开始或结束，则需要groupby总和

# -1 and NaN were really placeholders for zero
In [17]: res = res.replace(0, np.nan).ffill(axis=1).replace([np.nan, -1], 0)

In [18]: res
Out[18]:
   2012-01-01  2012-02-01  2012-03-01  2012-04-01  2012-05-01     ...      2018-09-01  2018-10-01  2018-11-01  2018-12-01  2019-01-01
0         0.0         0.0         0.0         0.0         0.0     ...             1.0         1.0         1.0         1.0         1.0
1         0.0         0.0         0.0         0.0         0.0     ...             0.0         1.0         1.0         1.0         1.0
2         0.0         0.0         0.0         0.0         0.0     ...             1.0         1.0         1.0         1.0         1.0
3         0.0         0.0         0.0         0.0         0.0     ...             1.0         1.0         1.0         1.0         1.0

你到底想做什么？这不是最终目标，是吗？@AndyHayden，这是最终目标。你认为这是可能的吗？好吧，似乎有点难以置信，但还好。@而且我的最终目标是在与其他DFS合并后，基本上将其用作时间序列数据。如果同一客户的日期重叠，则无法正常工作。我在问题中增加了一个例子。