Warning: file_get_contents(/data/phpspider/zhask/data//catemap/5/date/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Date 熊猫:如何在偏移日期合并两个数据帧?_Date_Join_Pandas_Merge_Offset - Fatal编程技术网

Date 熊猫:如何在偏移日期合并两个数据帧?

Date 熊猫:如何在偏移日期合并两个数据帧?,date,join,pandas,merge,offset,Date,Join,Pandas,Merge,Offset,我想根据df2行是否在df1行之后的3-6个月的日期范围内合并两个数据帧df1和df2。例如: df1(我有每家公司的季度数据): df2(对于每个公司,我都有可以在任何一天发生的活动日期): 公司事件日期 0 012345 2005-07-28这实际上是一个罕见的问题,不同的解决方案的算法复杂性可能会显著不同。您可能想考虑一下1-LIN片段的细微之处。 算法上: 根据日期对较大的数据帧进行排序 对于较小数据框中的每个日期,使用模块查找较大数据框中的相关行 对于长度分别为m和n(m

我想根据df2行是否在df1行之后的3-6个月的日期范围内合并两个数据帧df1和df2。例如:

df1(我有每家公司的季度数据):

df2(对于每个公司,我都有可以在任何一天发生的活动日期):

公司事件日期

0 012345 2005-07-28这实际上是一个罕见的问题,不同的解决方案的算法复杂性可能会显著不同。您可能想考虑一下1-LIN片段的细微之处。

算法上:

  • 根据日期对较大的数据帧进行排序

  • 对于较小数据框中的每个日期,使用模块查找较大数据框中的相关行


对于长度分别为m和n(m这是我的解决方案,与Ami Tavory建议的算法不同:

#find the date offsets to define date ranges
start_time = df1.DATADATE.apply(pd.offsets.MonthEnd(3))
end_time = df1.DATADATE.apply(pd.offsets.MonthEnd(6))

#make these extra columns
df1['start_time'] = start_time
df1['end_time'] = end_time

#find unique company names in both dfs
unique_companies_df1 = df1.company.unique()
unique_companies_df2 = df2.company.unique()

#sort df1 by company and DATADATE, so we can iterate in a sensible order
sorted_df1=df1.sort(['company','DATADATE']).reset_index(drop=True)

#define empty df to append data
df3 = pd.DataFrame()

#iterate through each company in df1, find 
#that company in sorted df2, then for each 
#DATADATE quarter of df1, bisect df2 in the 
#correct locations (i.e. start_time to end_time)

for cmpny in unique_companies_df1:

    if cmpny in unique_companies_df2: #if this company is in both dfs, take the relevant rows that are associated with this company 
        selected_df2 = df2[df2.company==cmpny].sort('EventDate').reset_index(drop=True)
        selected_df1 = sorted_df1[sorted_df1.company==cmpny].reset_index(drop=True)

        for quarter in xrange(len(selected_df1.DATADATE)): #iterate through each DATADATE quarter in df1
            lo=bisect.bisect_right(selected_df2.EventDate,selected_CS.start_time[quarter]) #bisect_right to ensure that we do not include dates before our date range
            hi=bisect.bisect_left(selected_IT.EventDate,selected_CS.end_time[quarter]) #bisect_left here to not include dates after our desired date range            
            df_right = selected_df2.loc[lo:hi].copy()  #grab all rows with EventDates that fall within our date range
            df_left = pd.DataFrame(selected_df1.loc[quarter]).transpose()

            if len(df_right)==0: # if no EventDates fall within range, create a row with cmpny in the 'company' column, and a NaT in the EventDate column to merge
                df_right.loc[0,'company']=cmpny

            temp = pd.merge(df_left,df_right,how='inner',on='company') #merge the df1 company quarter with all df2's rows that fell within date range
            df3=df3.append(temp)

我按照您提供的步骤实现了它,并在上面发布了我的代码。虽然在我的大数据集上花费了很长时间,但它是有效的。我原本希望能够将pandas groupby合并到df1中,并按['company'、'DATADATE']和groupby.apply()对其进行分组。这是一个函数,可以获取df2中的相关行,其事件日期介于df1中每行的开始时间和结束时间之间(即DATADATE之后的3-6个月)。这很有趣。当我有时间的时候,我真的很乐意深入地看看你的答案。
    company EventDate
0   012345  2005-07-28 <-- won't get merged b/c not within date range
1   012345  2005-10-12
2   123456  2005-05-15
3   123456  2005-05-17
4   123456  2005-05-25
5   123456  2005-05-30
6   123456  2005-08-08
7   123456  2005-11-29
8   abcxyz  2005-12-31 <-- won't be merged because company not in df1
    company DATADATE    EventDate
0   012345  2005-06-30  2005-10-12
1   012345  2005-09-30  NaN   <-- nan because no EventDates fell in this range
2   012345  2005-12-31  NaN
3   012345  2006-03-31  NaN
4   123456  2005-01-31  2005-05-15
5   123456  2005-01-31  2005-05-17
5   123456  2005-01-31  2005-05-25
5   123456  2005-01-31  2005-05-30
6   123456  2005-03-31  2005-08-08
7   123456  2005-06-30  2005-11-19
8   123456  2005-09-30  NaN
#find the date offsets to define date ranges
start_time = df1.DATADATE.apply(pd.offsets.MonthEnd(3))
end_time = df1.DATADATE.apply(pd.offsets.MonthEnd(6))

#make these extra columns
df1['start_time'] = start_time
df1['end_time'] = end_time

#find unique company names in both dfs
unique_companies_df1 = df1.company.unique()
unique_companies_df2 = df2.company.unique()

#sort df1 by company and DATADATE, so we can iterate in a sensible order
sorted_df1=df1.sort(['company','DATADATE']).reset_index(drop=True)

#define empty df to append data
df3 = pd.DataFrame()

#iterate through each company in df1, find 
#that company in sorted df2, then for each 
#DATADATE quarter of df1, bisect df2 in the 
#correct locations (i.e. start_time to end_time)

for cmpny in unique_companies_df1:

    if cmpny in unique_companies_df2: #if this company is in both dfs, take the relevant rows that are associated with this company 
        selected_df2 = df2[df2.company==cmpny].sort('EventDate').reset_index(drop=True)
        selected_df1 = sorted_df1[sorted_df1.company==cmpny].reset_index(drop=True)

        for quarter in xrange(len(selected_df1.DATADATE)): #iterate through each DATADATE quarter in df1
            lo=bisect.bisect_right(selected_df2.EventDate,selected_CS.start_time[quarter]) #bisect_right to ensure that we do not include dates before our date range
            hi=bisect.bisect_left(selected_IT.EventDate,selected_CS.end_time[quarter]) #bisect_left here to not include dates after our desired date range            
            df_right = selected_df2.loc[lo:hi].copy()  #grab all rows with EventDates that fall within our date range
            df_left = pd.DataFrame(selected_df1.loc[quarter]).transpose()

            if len(df_right)==0: # if no EventDates fall within range, create a row with cmpny in the 'company' column, and a NaT in the EventDate column to merge
                df_right.loc[0,'company']=cmpny

            temp = pd.merge(df_left,df_right,how='inner',on='company') #merge the df1 company quarter with all df2's rows that fell within date range
            df3=df3.append(temp)