Warning: file_get_contents(/data/phpspider/zhask/data//catemap/8/python-3.x/15.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
如何在python中左连接2个数据帧,如果筛选后的第2个数据帧中有多个匹配行,则使用第一行连接_Python_Python 3.x_Pandas_Dataframe - Fatal编程技术网

如何在python中左连接2个数据帧,如果筛选后的第2个数据帧中有多个匹配行,则使用第一行连接

如何在python中左连接2个数据帧,如果筛选后的第2个数据帧中有多个匹配行,则使用第一行连接,python,python-3.x,pandas,dataframe,Python,Python 3.x,Pandas,Dataframe,我有2个数据帧,每个数据帧都有一个数据类型为datatime的列。 我想在以下条件下将第二个数据帧与第一个数据帧连接起来 查找其日期时间值介于第二个数据帧的日期时间值和前10分钟之间的第一个数据帧的行 如果超过一行,则取第一行 如果没有这样的行,则填充为空或null 一行只能连接一次 现在我用下面的方法做。我想知道,是否有更好的方法来减少总运行时间 from datetime import datetime import datetime as dt import pandas as pd

我有2个数据帧,每个数据帧都有一个数据类型为datatime的列。 我想在以下条件下将第二个数据帧与第一个数据帧连接起来

  • 查找其日期时间值介于第二个数据帧的日期时间值和前10分钟之间的第一个数据帧的行

  • 如果超过一行,则取第一行

  • 如果没有这样的行,则填充为空或null

  • 一行只能连接一次

  • 现在我用下面的方法做。我想知道,是否有更好的方法来减少总运行时间

    from datetime import datetime
    import datetime as dt
    import pandas as pd
    
    
    df1 = pd.DataFrame(columns = ['Enter_Time', 'Unique_Id'])
    df1.loc[len(df1)] = [datetime.strptime('2018-10-01 06:29:00','%Y-%m-%d %H:%M:%S'), 'A']
    df1.loc[len(df1)] = [datetime.strptime('2018-10-01 06:30:00','%Y-%m-%d %H:%M:%S'), 'B']
    df1.loc[len(df1)] = [datetime.strptime('2018-10-01 06:31:00','%Y-%m-%d %H:%M:%S'), 'C']
    df1.loc[len(df1)] = [datetime.strptime('2018-10-01 06:32:00','%Y-%m-%d %H:%M:%S'), 'D']
    df1.loc[len(df1)] = [datetime.strptime('2018-10-01 06:33:00','%Y-%m-%d %H:%M:%S'), 'E']
    df1.loc[len(df1)] = [datetime.strptime('2018-10-01 08:29:00','%Y-%m-%d %H:%M:%S'), 'F']
    df1.loc[len(df1)] = [datetime.strptime('2018-10-01 08:30:00','%Y-%m-%d %H:%M:%S'), 'G']
    df1.loc[len(df1)] = [datetime.strptime('2018-10-01 08:31:00','%Y-%m-%d %H:%M:%S'), 'H']
    df1.loc[len(df1)] = [datetime.strptime('2018-10-01 08:32:00','%Y-%m-%d %H:%M:%S'), 'I']
    df1.loc[len(df1)] = [datetime.strptime('2018-10-01 08:33:00','%Y-%m-%d %H:%M:%S'), 'j']
    
    
    df2 = pd.DataFrame(columns = ['Transaction_Time', 'Amount'])
    df2.loc[len(df2)] = [datetime.strptime('2018-10-01 06:40:00','%Y-%m-%d %H:%M:%S'), 10.25]
    df2.loc[len(df2)] = [datetime.strptime('2018-10-01 07:40:00','%Y-%m-%d %H:%M:%S'), 3.96]
    df2.loc[len(df2)] = [datetime.strptime('2018-10-01 08:31:00','%Y-%m-%d %H:%M:%S'), 9.65]
    df2.loc[len(df2)] = [datetime.strptime('2018-10-01 08:32:00','%Y-%m-%d %H:%M:%S'), 2.84]
    
    df3 = pd.DataFrame(columns = ['Transaction_Time', 'Amount', 'Enter_Time', 'Unique_Id'])
    
    for id, row in df2.iterrows():
        Transaction_Time = row['Transaction_Time']
        Transaction_Time_Before = Transaction_Time - dt.timedelta(seconds = 600)
        Result_Row = {
            'Transaction_Time' : row['Transaction_Time'],
            'Amount' : row['Amount'],
            'Enter_Time' : '',
            'Unique_Id' : ''
        }
    
        dfFiletered = df1[(df1["Enter_Time"] < Transaction_Time) & (df1["Enter_Time"] >= Transaction_Time_Before)].sort_values(by= ['Enter_Time'],ascending=True)
        if len(dfFiletered) > 0:
            firstRow = dfFiletered.iloc[0]
            Result_Row['Enter_Time'] = firstRow['Enter_Time']
            Result_Row['Unique_Id'] = firstRow['Unique_Id']
            df1.drop(df1[df1["Unique_Id"] == firstRow['Unique_Id']].index, inplace=True)
        df3.loc[len(df3)] = Result_Row
    print(df3)
    
    从日期时间导入日期时间
    将日期时间导入为dt
    作为pd进口熊猫
    df1=pd.DataFrame(列=['Enter\u Time','Unique\u Id'])
    df1.loc[len(df1)]=[datetime.strtime('2018-10-01 06:29:00','%Y-%m-%d%H:%m:%S'),'A']
    df1.loc[len(df1)]=[datetime.strtime('2018-10-01 06:30:00','%Y-%m-%d%H:%m:%S'),'B']
    df1.loc[len(df1)]=[datetime.strtime('2018-10-01 06:31:00','%Y-%m-%d%H:%m:%S'),'C']
    df1.loc[len(df1)]=[datetime.strtime('2018-10-01 06:32:00','%Y-%m-%d%H:%m:%S'),'d']
    df1.loc[len(df1)]=[datetime.strtime('2018-10-01 06:33:00','%Y-%m-%d%H:%m:%S'),'E']
    df1.loc[len(df1)]=[datetime.strtime('2018-10-01 08:29:00','%Y-%m-%d%H:%m:%S'),'F']
    df1.loc[len(df1)]=[datetime.strtime('2018-10-01 08:30:00','%Y-%m-%d%H:%m:%S'),'G']
    df1.loc[len(df1)]=[datetime.strtime('2018-10-01 08:31:00','%Y-%m-%d%H:%m:%S'),'H']
    df1.loc[len(df1)]=[datetime.strtime('2018-10-01 08:32:00','%Y-%m-%d%H:%m:%S'),'I']
    df1.loc[len(df1)]=[datetime.strtime('2018-10-01 08:33:00','%Y-%m-%d%H:%m:%S'),'j']
    df2=pd.DataFrame(列=['Transaction\u Time','Amount'])
    df2.loc[len(df2)]=[datetime.strtime('2018-10-01 06:40:00','%Y-%m-%d%H:%m:%S'),10.25]
    df2.loc[len(df2)]=[datetime.strtime('2018-10-01 07:40:00','%Y-%m-%d%H:%m:%S'),3.96]
    df2.loc[len(df2)]=[datetime.strtime('2018-10-01 08:31:00','%Y-%m-%d%H:%m:%S'),9.65]
    df2.loc[len(df2)]=[datetime.strtime('2018-10-01 08:32:00','%Y-%m-%d%H:%m:%S'),2.84]
    df3=pd.DataFrame(列=['Transaction\u Time'、'Amount'、'Enter\u Time'、'Unique\u Id'])
    对于id,df2.iterrows()中的行:
    事务处理时间=行['Transaction\u Time']
    事务时间之前=事务时间-dt.timedelta(秒=600)
    结果_行={
    “事务时间”:行[“事务时间],
    “金额”:行[“金额”],
    “输入时间”:“,
    “唯一Id”:”
    }
    dfFiletered=df1[(df1[“输入时间”]<事务时间)和(df1[“输入时间”]>=事务时间之前)]。排序值(按=[“输入时间”],升序=True)
    如果len(dfFiletered)>0:
    firstRow=dfFiletered.iloc[0]
    结果_行['Enter_Time']=第一行['Enter_Time']
    结果_行['Unique_Id']=第一行['Unique_Id']
    df1.drop(df1[df1[“Unique_Id”]==firstRow['Unique_Id']]]。索引,inplace=True)
    df3.loc[len(df3)]=结果行
    打印(df3)
    
    您可以使用:

    它将产生:

    #           Enter_Time Unique_Id    Transaction_Time  Amount
    #0 2018-10-01 06:29:00         A                 NaT     NaN
    #1 2018-10-01 06:30:00         B 2018-10-01 06:40:00   10.25
    #2 2018-10-01 06:31:00         C 2018-10-01 06:40:00   10.25
    #3 2018-10-01 06:32:00         D 2018-10-01 06:40:00   10.25
    #4 2018-10-01 06:33:00         E 2018-10-01 06:40:00   10.25
    #5 2018-10-01 08:29:00         F 2018-10-01 08:31:00    9.65
    #6 2018-10-01 08:30:00         G 2018-10-01 08:31:00    9.65
    #7 2018-10-01 08:31:00         H 2018-10-01 08:31:00    9.65
    #8 2018-10-01 08:32:00         I 2018-10-01 08:32:00    2.84
    #9 2018-10-01 08:33:00         j                 NaT     NaN
    
    并且只保留首次使用:

    df = pd.merge_asof(df1,
                       df2,
                       left_on='Enter_Time',
                       right_on='Transaction_Time',
                       tolerance=pd.Timedelta('10m'),
                       direction='forward')
    
    df.loc[df.duplicated(['Transaction_Time', 'Amount']), ['Transaction_Time', 'Amount']] = (np.nan, np.nan)
    df
    #           Enter_Time Unique_Id    Transaction_Time  Amount
    #0 2018-10-01 06:29:00         A                 NaT     NaN
    #1 2018-10-01 06:30:00         B 2018-10-01 06:40:00   10.25
    #2 2018-10-01 06:31:00         C                 NaT     NaN
    #3 2018-10-01 06:32:00         D                 NaT     NaN
    #4 2018-10-01 06:33:00         E                 NaT     NaN
    #5 2018-10-01 08:29:00         F 2018-10-01 08:31:00    9.65
    #6 2018-10-01 08:30:00         G                 NaT     NaN
    #7 2018-10-01 08:31:00         H                 NaT     NaN
    #8 2018-10-01 08:32:00         I 2018-10-01 08:32:00    2.84
    #9 2018-10-01 08:33:00         j                 NaT     NaN
    
    编辑

    要将
    df2
    df1
    合并,我想您需要保留默认方向(
    “向后”
    ):


    duplicated的转换不会影响您的示例,但它是用来解决问题的。

    我尝试了它
    df4=pd.merge\u asof(df2,df1,left\u on='Transaction\u Time',right\u on='Enter\u Time',tolerance=pd.Timedelta('10m'),direction='forward')
    但第一排没有任何马赫数。如何获得第一行的正确连接?对不起,如果我的问题造成混乱。df2是左边的数据帧,我想加入df1。因此,您的示例如下所示
    df=pd.merge\u asof(df2,df1,left\u on='Transaction\u Time',right\u on='Enter\u Time',tolerance=pd.Timedelta('10m'),direction='forward')
    。它可以工作,但是df1的第二行被连接,而不是第一行。而且第一行没有任何联接。我尝试使用“11m”作为增量值,但相同resutl@Xpeditions我认为方向应该是默认的,就像在edit中一样。事务时间6:40与输入时间6:33合并。但它应该与输入时间6:30合并-从10分钟间隔开始第一次合并。
    df = pd.merge_asof(df1,
                       df2,
                       left_on='Enter_Time',
                       right_on='Transaction_Time',
                       tolerance=pd.Timedelta('10m'),
                       direction='forward')
    
    df.loc[df.duplicated(['Transaction_Time', 'Amount']), ['Transaction_Time', 'Amount']] = (np.nan, np.nan)
    df
    #           Enter_Time Unique_Id    Transaction_Time  Amount
    #0 2018-10-01 06:29:00         A                 NaT     NaN
    #1 2018-10-01 06:30:00         B 2018-10-01 06:40:00   10.25
    #2 2018-10-01 06:31:00         C                 NaT     NaN
    #3 2018-10-01 06:32:00         D                 NaT     NaN
    #4 2018-10-01 06:33:00         E                 NaT     NaN
    #5 2018-10-01 08:29:00         F 2018-10-01 08:31:00    9.65
    #6 2018-10-01 08:30:00         G                 NaT     NaN
    #7 2018-10-01 08:31:00         H                 NaT     NaN
    #8 2018-10-01 08:32:00         I 2018-10-01 08:32:00    2.84
    #9 2018-10-01 08:33:00         j                 NaT     NaN
    
    df = pd.merge_asof(df2,
                       df1,
                       left_on='Transaction_Time',
                       right_on='Enter_Time',
                       tolerance=pd.Timedelta('10m'))
    
    df.loc[df.duplicated(['Transaction_Time', 'Amount']), ['Transaction_Time', 'Amount']] = (np.nan, np.nan)
    #     Transaction_Time  Amount          Enter_Time Unique_Id
    #0 2018-10-01 06:40:00   10.25 2018-10-01 06:33:00         E
    #1 2018-10-01 07:40:00    3.96                 NaT       NaN
    #2 2018-10-01 08:31:00    9.65 2018-10-01 08:31:00         H
    #3 2018-10-01 08:32:00    2.84 2018-10-01 08:32:00         I