Python 按日期加入

Python 按日期加入,python,pandas,Python,Pandas,我正在尝试连接两个日期不完全匹配的数据帧。对于左数据框中的给定组/日期,我希望将右数据框中的相应记录和左数据框之前的日期连接起来。也许用一个例子最容易说明 df1: df2: 给了我们: group date teacher hair length a 1/10/00 1 8 a 2/27/00 1 20 b 1/7/00 1 8

我正在尝试连接两个日期不完全匹配的数据帧。对于左数据框中的给定组/日期,我希望将右数据框中的相应记录和左数据框之前的日期连接起来。也许用一个例子最容易说明

df1:

df2:

给了我们:

group     date      teacher    hair length
  a     1/10/00        1           8
  a     2/27/00        1          20
  b     1/7/00         1           8
  b     4/5/00         1         100
  c     2/9/00         2           0
  c     9/12/00        2          50
编辑1:
拼凑出一种方法来做这件事。基本上,我遍历了df1中的每一行,并选择了df2中最新的对应条目。速度太慢了,肯定有更好的方法。

一种方法是在左侧数据框中创建一个新列,该列将(对于给定行的日期)确定最接近和较早的值:

df1['join_date'] = df1.date.map(lambda x: df2.date[df2.date <= x].max())

似乎最快的方法是通过pysqldf使用sqlite:

def partial_versioned_join(tablea, tableb, tablea_keys, tableb_keys):

    try:
        tablea_group, tablea_date = tablea_keys
        tableb_group, tableb_date = tableb_keys
    except ValueError, e:
        raise(e, 'Need to pass in both a group and date key for both tables')

    # Note: can't actually use group here as a field name due to sqlite
    statement = """SELECT a.group, a.{date_a} AS {temp_date}, b.*
                    FROM (SELECT tablea.group, tablea.{date_a}, tablea.{group_a},
                         MAX(tableb.{date_b}) AS tdate
                        FROM tablea
                        JOIN tableb
                        ON tablea.{group_a}=tableb.{group_b}
                        AND tablea.{date_a}>=tableb.{date_b}
                        GROUP BY tablea.{base_id}, tablea.{date_a}, tablea.{group_a}
                        ) AS a
                    JOIN tableb b
                    ON   a.{group_a}=b.{group_b}
                    AND  a.tdate=b.{date_b};
                    """.format(group_a=tablea_group, date_a=tablea_date, 
                               group_b=tableb_group, date_b=tableb_date,
                               temp_date='join_date', base_id=base_id)
    # Note: you lose types here for tableb so you may want to save them
    pre_join_tableb = sqldf(statement, locals())
    return pd.merge(tablea, pre_join_tableb, how='inner',
                    left_on=['group'] + tablea_keys,
                    right_on=['group', tableb_group, 'join_date'])
df1['join_date'] = df1.date.map(lambda x: df2.date[df2.date <= x].max())
# Assuming df1 and df2 are sorted by the dates

df1['hair length'] = 0 # initialize

r_generator = df2.iterrows()
_, cur_r_row = next(r_generator)

for i, l_row in df1.iterrows():
    cur_hair_length = 0 # Assume 0 works when df1 has a date earlier than df2

    while cur_r_row['date'] <= l_row['date']:
        cur_hair_length = cur_r_row['hair length']
        try:
            _, cur_r_row = next(r_generator)
        except StopIteration:
            break

    df1.loc[i, 'hair length'] = cur_hair_length
def partial_versioned_join(tablea, tableb, tablea_keys, tableb_keys):

    try:
        tablea_group, tablea_date = tablea_keys
        tableb_group, tableb_date = tableb_keys
    except ValueError, e:
        raise(e, 'Need to pass in both a group and date key for both tables')

    # Note: can't actually use group here as a field name due to sqlite
    statement = """SELECT a.group, a.{date_a} AS {temp_date}, b.*
                    FROM (SELECT tablea.group, tablea.{date_a}, tablea.{group_a},
                         MAX(tableb.{date_b}) AS tdate
                        FROM tablea
                        JOIN tableb
                        ON tablea.{group_a}=tableb.{group_b}
                        AND tablea.{date_a}>=tableb.{date_b}
                        GROUP BY tablea.{base_id}, tablea.{date_a}, tablea.{group_a}
                        ) AS a
                    JOIN tableb b
                    ON   a.{group_a}=b.{group_b}
                    AND  a.tdate=b.{date_b};
                    """.format(group_a=tablea_group, date_a=tablea_date, 
                               group_b=tableb_group, date_b=tableb_date,
                               temp_date='join_date', base_id=base_id)
    # Note: you lose types here for tableb so you may want to save them
    pre_join_tableb = sqldf(statement, locals())
    return pd.merge(tablea, pre_join_tableb, how='inner',
                    left_on=['group'] + tablea_keys,
                    right_on=['group', tableb_group, 'join_date'])