Python 熊猫：要根据开始和结束日期分解数据吗_Python_Python 3.x_Pandas_Dataframe

Python 熊猫：要根据开始和结束日期分解数据吗

python python-3.x pandas dataframe

Python 熊猫：要根据开始和结束日期分解数据吗,python,python-3.x,pandas,dataframe,Python,Python 3.x,Pandas,Dataframe,我想根据开始和结束数据分解数据帧。它应该间隔10天或每个月的最后几天。例如，我的输入数据框如下所示： df 我的输出应该如下所示： id,start_date,end_date,points,number_of_days 1,2020-01-01,2020-01-10,100,10 1,2020-01-11,2020-01-20,100,10 2,2020-01-11,2020-01-20,200,10 2,2020-01-21,2020-01-31,200,11 2,2020-02-01,20

我想根据开始和结束数据分解数据帧。它应该间隔10天或每个月的最后几天。例如，我的输入数据框如下所示：

我的输出应该如下所示：

id,start_date,end_date,points,number_of_days
1,2020-01-01,2020-01-10,100,10
1,2020-01-11,2020-01-20,100,10
2,2020-01-11,2020-01-20,200,10
2,2020-01-21,2020-01-31,200,11
2,2020-02-01,2020-02-10,200,10
3,2020-04-21,2020-04-30,300,10
3,2020-05-01,2020-05-10,300,10
4,2020-02-21,2020-02-29,400,9
4,2020-03-01,2020-03-10,400,10

这可能相当棘手。首先，我假设您的数据是由

id

列索引的，如果不是，您可以使用以下工具轻松完成：

df.set_index("id", inplace = True)

另外，请确保使用

datetime

列：

df["start_date"] = pd.to_datetime(df["start_date"])
df["end_date"] = pd.to_datetime(df["end_date"])

最后一件事要考虑的是，我过去常常加上日期：

def add_days(date, days):
    return date + pd.DateOffset(days=days)

现在，让我们开始。在我看来，最困难的事情是构建一个分割日期间隔的函数。请记住，此函数可能会从原始数据帧接收一行作为参数，并且它必须返回多行。我将返回一份清单：

def split_dates(row):

    start = row["start_date"]
    end = row["end_date"]
    points = row["points"]
    new_row = []

    curr_date = start
    while curr_date < end:

        delta_days = 10
        curr_date_aux = add_days(curr_date, delta_days)

        if curr_date_aux.day != curr_date.daysinmonth:
            delta_days = 9
            curr_date_aux = add_days(curr_date, delta_days)

        if curr_date_aux > end:
            delta_days = (end-start).days
            curr_date_aux = add_days(curr_date, delta_days)
        
        if curr_date_aux.month != curr_date.month:
            delta_days = -curr_date.day + curr_date.daysinmonth
            curr_date_aux = add_days(curr_date, delta_days) 
        
        new_row.append([curr_date, curr_date_aux, points, delta_days + 1])
        curr_date = add_days(curr_date_aux, 1)

    return new_row

并（最后）获得所需的数据帧：

new_df = pd.DataFrame(new_dates.tolist(),
    index = new_dates.index,
    columns = ["start_date", "end_date", "points", "number of days"])

#   start_date   end_date  points  number of days
#id                                              
#1  2020-01-01 2020-01-10     100              10
#1  2020-01-11 2020-01-20     100              10
#2  2020-01-11 2020-01-20     200              10
#2  2020-01-21 2020-01-31     200              11
#2  2020-02-01 2020-02-10     200              10
#3  2020-04-21 2020-04-30     300              10
#3  2020-05-01 2020-05-10     300              10
#4  2020-02-21 2020-02-29     400               9
#4  2020-03-01 2020-03-10     400              10

能否明确定义用于分解数据库的逻辑。例如，为什么预期的输出包含每一行。我正在从Excel文件读取记录，而不是从任何数据库读取记录。我没有必要使用爆炸功能。我只需要预期的输出。预期输出应以10天为间隔包含开始和结束日期，或者对于最后一个时段，它可以包含8、9、10、11天，具体取决于月份和年份。我仍然不明白您是如何将输入转换为预期输出的。每个输出行的显式逻辑是什么？：我没有转换为所需的输出。我想把逻辑转换成期望的结果output@BasantJain请让我知道我的回答是否有用

new_dates = df.apply(split_dates, axis = 1).explode()

new_df = pd.DataFrame(new_dates.tolist(),
    index = new_dates.index,
    columns = ["start_date", "end_date", "points", "number of days"])

#   start_date   end_date  points  number of days
#id                                              
#1  2020-01-01 2020-01-10     100              10
#1  2020-01-11 2020-01-20     100              10
#2  2020-01-11 2020-01-20     200              10
#2  2020-01-21 2020-01-31     200              11
#2  2020-02-01 2020-02-10     200              10
#3  2020-04-21 2020-04-30     300              10
#3  2020-05-01 2020-05-10     300              10
#4  2020-02-21 2020-02-29     400               9
#4  2020-03-01 2020-03-10     400              10