Python 快速构建大熊猫日期多索引的方法_Python_Pandas_Dataframe_Multi Index

Python 快速构建大熊猫日期多索引的方法

python pandas dataframe

Python 快速构建大熊猫日期多索引的方法,python,pandas,dataframe,multi-index,Python,Pandas,Dataframe,Multi Index,我有一个熊猫数据帧，df。以下是前五行： Id StartDate EndDate 0 0 2015-08-11 2018-07-13 1 1 2014-02-15 2016-01-25 2 2 2014-12-20 NaT 3 3 2015-01-09 2015-01-14 4 4 2014-07-20 NaT 我想构造一个新的数据帧，df2df2应该在StartDate和EndDate之间的每个月都有一行。例如，由于df1的第

我有一个熊猫数据帧，

df

。以下是前五行：

    Id  StartDate   EndDate
0   0   2015-08-11  2018-07-13
1   1   2014-02-15  2016-01-25
2   2   2014-12-20  NaT
3   3   2015-01-09  2015-01-14
4   4   2014-07-20  NaT

我想构造一个新的数据帧，

df2

<对于

df1

中的每个

Id

，code>df2应该在

StartDate

和

EndDate

之间的每个月都有一行。例如，由于

df1

的第一行在2015年8月有

StartDate

，在2018年7月有

EndDate

，

df2

应该有对应于2015年8月、2015年9月、…、2018年7月的行。如果

df1

中的

Id

没有

EndDate

，我们将其视为2019年6月

我希望

df2

使用多索引，第一级为

df1

中相应的

Id

，第二级为年份，第三级为月份。例如，如果上述五行都是

df1

，则

df2

应该如下所示：

Id  Year    Month
0   2015    8
            9
            10
            11
            12
    2016    1
            2
            3
            4
            5
            6
            7
            8
            9
            10
            11
            12
    2017    1
            2
            3
            4
            5
            6
            7
            8
            9
            10
            11
            12
    2018    1
... ... ...
4   2017    1
            2
            3
            4
            5
            6
            7
            8
            9
            10
            11
            12
    2018    1
            2
            3
            4
            5
            6
            7
            8
            9
            10
            11
            12
    2019    1
            2
            3
            4
            5
            6

下面的代码可以做到这一点，但在我的笔记本电脑上运行10k

Id

s大约需要20秒。我能提高效率吗

import numpy as np

def build_multiindex_for_id_(id_, enroll_month, enroll_year, cancel_month, cancel_year):
    # Given id_ and start/end dates,
    # returns 2d array to be converted to multiindex.
    # Each row of returned array represents a month/year
    # between enroll date and cancel date inclusive.
    year = enroll_year
    month = enroll_month
    multiindex_array = [[],[],[]]
    while (month != cancel_month) or (year != cancel_year):
        multiindex_array[0].append(id_)
        multiindex_array[1].append(year)
        multiindex_array[2].append(month)
        month += 1
        if month == 13:
            month = 1
            year += 1
    multiindex_array[0].append(id_)
    multiindex_array[1].append(year)
    multiindex_array[2].append(month)    
    return np.array(multiindex_array)


# Begin by constructing array for first id.
array_for_multiindex = build_multiindex_for_id_(0,8,2015,7,2018)

# Append the rest of the multiindices for the remaining ids.
for _, row in df.loc[1:].fillna(pd.to_datetime('2019-06-30')).iterrows():
    current_id_array = build_multiindex_for_id_(
        row['Id'],
        row['StartDate'].month,
        row['StartDate'].year,
        row['EndDate'].month,
        row['EndDate'].year)
    array_for_multiindex = np.append(array_for_multiindex, current_id_array, axis=1)

df2_index = pd.MultiIndex.from_arrays(array_for_multiindex).rename(['Id','Year','Month'])

pd.DataFrame(index=df2_index)

经过几次尝试和错误后，以下是我的方法：

(df.melt(id_vars='Id')
   .fillna(pd.to_datetime('June 2019'))
   .set_index('value')
   .groupby('Id').apply(lambda x: x.asfreq('M').ffill())
   .reset_index('value')
   .assign(year=lambda x: x['value'].dt.year,
           month=lambda x: x['value'].dt.month)
   .set_index(['year','month'], append=True)
)

输出：

                   value  Id variable
Id year month                        
0  2015 8     2015-08-31 NaN      NaN
        9     2015-09-30 NaN      NaN
        10    2015-10-31 NaN      NaN
        11    2015-11-30 NaN      NaN
        12    2015-12-31 NaN      NaN
   2016 1     2016-01-31 NaN      NaN
        2     2016-02-29 NaN      NaN
        3     2016-03-31 NaN      NaN
        4     2016-04-30 NaN      NaN
        5     2016-05-31 NaN      NaN
        6     2016-06-30 NaN      NaN
        7     2016-07-31 NaN      NaN
        8     2016-08-31 NaN      NaN
        9     2016-09-30 NaN      NaN
        10    2016-10-31 NaN      NaN

对于没有结束日期的ID2和ID4，会发生什么？我喜欢你的答案，但它实际上比我建议的要慢。我的10公里行需要20秒，而你的需要28秒。你让我意识到了asfreq方法——这非常有用，谢谢。