Python pandas-基于满足条件的列合并行_Python_Pandas_Merge_Conditional

Python pandas-基于满足条件的列合并行

python pandas merge

Python pandas-基于满足条件的列合并行,python,pandas,merge,conditional,Python,Pandas,Merge,Conditional,我刚接触熊猫，我不知道最好的方法我在两个不同的数据框中放置了两个文件： >> frame1.head() Out[64]: Date and Time Sample Unnamed: 2 0 05/18/2017 08:38:37:490 163.7 NaN 1 05/18/2017 08:39:37:490 164.5 NaN 2 05/18/2017 08:40:37:490 148.7 NaN 3 05/18/201

我刚接触熊猫，我不知道最好的方法

我在两个不同的数据框中放置了两个文件：

>> frame1.head()
Out[64]:

    Date and Time           Sample  Unnamed: 2
0   05/18/2017 08:38:37:490 163.7   NaN
1   05/18/2017 08:39:37:490 164.5   NaN
2   05/18/2017 08:40:37:490 148.7   NaN
3   05/18/2017 08:41:37:490 111.2   NaN
4   05/18/2017 08:42:37:490 83.6    NaN


>>frame2.head()
Out[66]:
Date and Time               Sample  Unnamed: 2
0   05/18/2017 08:38:38:490 7.5 NaN
1   05/18/2017 08:39:38:490 7.5 NaN
2   05/18/2017 08:40:38:490 7.5 NaN
3   05/18/2017 08:41:38:490 7.5 NaN
4   05/18/2017 08:42:38:490 7.5 NaN

我需要将第1帧中的任何一行与第2帧中的任何一行“合并”，这两行之间的距离不超过一秒

比如说,，第1帧中的此行：

0   05/18/2017 08:38:37:490 163.7   NaN

从第2帧开始，在此行的1秒之内：

0   05/18/2017 08:38:38:490 7.5 NaN

因此，当它们“合并”时，输出应如下所示：

0   05/18/2017 08:38:37:490 163.7 7.5 NaN NaN

pd.merge_asof( df1, df2, on='datetime', tolerance=pd.Timedelta('1s') )  \
    .append(df1.rename(columns={'sample':'sample_x'})).drop_duplicates('sample_x')
Out[236]: 
             datetime  sample_x  sample_y
0 2017-01-01 00:00:00       100       0.0
1 2017-01-01 00:00:01       101       3.0
2 2017-01-01 00:00:02       102       NaN
3 2017-01-01 00:00:03       103       NaN

换句话说，一行的时间被另一行替换，其余的所有列都被追加

我想到的最接近的方法是：

    d3 = pd.merge(frame1, frame2, on='Date and Time (MM/DD/YYYY HH:MM:SS:sss)', how='outer')

>>d3.head()
    Date and Time           Sample_x    Unnamed: 2_x    Sample_y    Unnamed: 2_y
0   05/18/2017 08:38:37:490 163.7   NaN NaN NaN
1   05/18/2017 08:39:37:490 164.5   NaN NaN NaN
2   05/18/2017 08:40:37:490 148.7   NaN NaN NaN
3   05/18/2017 08:41:37:490 111.2   NaN NaN NaN
4   05/18/2017 08:42:37:490 83.6    NaN NaN NaN

def compare_time(temp, sec=1):
   return abs(current - temp) <= datetime.timedelta(seconds=sec)

但是，这不是有条件的合并。我需要合并，如果它们彼此在一秒钟之内，而不仅仅是完全相同

我知道我可以用这样的东西来比较时间：

    d3 = pd.merge(frame1, frame2, on='Date and Time (MM/DD/YYYY HH:MM:SS:sss)', how='outer')

>>d3.head()
    Date and Time           Sample_x    Unnamed: 2_x    Sample_y    Unnamed: 2_y
0   05/18/2017 08:38:37:490 163.7   NaN NaN NaN
1   05/18/2017 08:39:37:490 164.5   NaN NaN NaN
2   05/18/2017 08:40:37:490 148.7   NaN NaN NaN
3   05/18/2017 08:41:37:490 111.2   NaN NaN NaN
4   05/18/2017 08:42:37:490 83.6    NaN NaN NaN

def compare_time(temp, sec=1):
   return abs(current - temp) <= datetime.timedelta(seconds=sec)

    datetime    sample_x    sample_y
0   2017-01-01 00:00:00.000 0   100.0
1   2017-01-01 00:00:00.300 1   100.0
2   2017-01-01 00:00:00.600 2   100.0
3   2017-01-01 00:00:00.900 3   100.0
0   2017-01-01 00:00:00.000 100 NaN
1   2017-01-01 00:00:01.000 101 NaN
2   2017-01-01 00:00:02.000 102 NaN
3   2017-01-01 00:00:03.000 103 NaN

请注意，它保留了原始行索引（0列出两次）

您可以按照@Wen的建议使用

merge\u asof

，但一定要为

公差指定可选值。还可以考虑设置匹配的<代码>方向>代码>选项值，该选项可以是“向后”（默认）、“最近”或“向前”。
pd.merge_asof( df1, df2, on='datetime', tolerance=pd.Timedelta('1s') )

这里有一个关于示例数据的较长解释（注意，我只是创建新的示例数据，因为我只能看到实际数据的前几行）：
请注意，merge\u asof
执行左连接，因此您可以通过更改df1和df2的顺序获得不同的答案：
pd.merge_asof( df2, df1, on='datetime', tolerance=pd.Timedelta('1s') )
Out[218]: 
                 datetime  sample_x  sample_y
0 2017-01-01 00:00:00.000         0       100
1 2017-01-01 00:00:00.300         1       100
2 2017-01-01 00:00:00.600         2       100
3 2017-01-01 00:00:00.900         3       100

编辑以添加：文档说merge\u asof
按设计进行左连接，但它似乎不同于真正的左连接，因为它排除了左数据框中不匹配的行。要解决此问题，您可以执行以下操作：
0   05/18/2017 08:38:37:490 163.7 7.5 NaN NaN

pd.merge_asof( df1, df2, on='datetime', tolerance=pd.Timedelta('1s') )  \
    .append(df1.rename(columns={'sample':'sample_x'})).drop_duplicates('sample_x')
Out[236]: 
             datetime  sample_x  sample_y
0 2017-01-01 00:00:00       100       0.0
1 2017-01-01 00:00:01       101       3.0
2 2017-01-01 00:00:02       102       NaN
3 2017-01-01 00:00:03       103       NaN

请注意，您可能需要根据是否具有唯一索引和/或唯一列来调整删除重复项。
您可以检查pd.merge\u asof
每个数据帧的行大小有多大？如果r1的时间戳为05/18/2017 08:38:37:490
，r2的时间戳为05/18/2017 08:39:36:490
，r3的时间戳为2017年5月18日08:40:35:490，它们将如何合并？r1和r2在一秒内，r2和r3也在一秒内。但是r1和r3不是。我怎样才能同时保留行df2.2和df2.3？结果数据框中缺少以.6结尾的时间。如何保留未合并的行？请注意，在合并结果中，缺少样本值101、102、103blah=pd.merge\u asof（df2、df1，on='datetime'，tolerance=pd.Timedelta（'1s'））\.append（df1.rename（columns={'sample'：'sample\u x'}）。drop\u duplicates（'sample\u x'）
（d2、d1的相反顺序）看起来它做了我想要的，但它有一个奇怪的效果，即有多行索引为零。。（我会在编辑@JillRussek中发布它。很抱歉没有完全遵循。我看到您在合并中更改了df2和df1的顺序，因此您可能需要将样本x切换到样本y，但我不能完全确定您要获得的最终输出