Python 熊猫：连接数据帧并合并相同列的值_Python_Join_Dataframe_Merge

Python 熊猫：连接数据帧并合并相同列的值

python join dataframe merge

Python 熊猫：连接数据帧并合并相同列的值,python,join,dataframe,merge,Python,Join,Dataframe,Merge,我有九个不同的数据帧，我想加入（或合并，或更新）到一个单一的数据帧。每个原始数据帧仅由两列组成，一个以秒为单位的列和该观察值。数据如下所示： Filter_type Time 0 0.0 6333.137168 Filter_type Time 0 0.0 6347.422576 Filter_type Time 0 0.0 7002.406185 Fi

我有九个不同的数据帧，我想加入（或合并，或更新）到一个单一的数据帧。每个原始数据帧仅由两列组成，一个以秒为单位的列和该观察值。数据如下所示：

   Filter_type         Time
0          0.0  6333.137168


   Filter_type         Time
0          0.0  6347.422576


   Filter_type         Time
0          0.0  7002.406185


   Filter_type         Time
0          0.0  7015.845717


   Sign_pos_X         Time
0        11.5  6333.137168
1        25.0  6347.422576
2        25.5  7002.406185
3        38.0  7015.845717


   Sign_pos_Y         Time
0        -3.0  6333.137168
1         8.0  6347.422576
2        -7.5  7002.406185
3        -0.5  7015.845717


   Sign_pos_Z         Time
0         1.0  6333.137168
1         1.0  6347.422576
2         1.0  7002.406185
3         7.5  7015.845717


   Supplementary_sign_type         Time
0                      0.0  6333.137168
1                      0.0  6347.422576
2                      0.0  7002.406185
3                      0.0  7015.845717


          Time  vision_only_sign_type
0  6333.137168                    7.0
1  6347.422576                    9.0
2  7002.406185                    9.0
3  7015.845717                   35.0

由于我想将所有数据帧合并到一个数据帧中，因此我尝试了以下方法：

df2 = None

for cell in df['Frames']:
    if not isinstance(cell, list):
        continue

    df_ = pd.DataFrame(cell)
    if df2 is None:
        # first iteration
        df2 = df_
        continue

    df2 = df2.merge(df_, on='Offset', how='outer') 
    #df2 = df2.join(df_)
    #df2.update(df_, join='outer')

df2

问题是，前四个数据帧具有相同的值列名称，而其他数据帧则没有。因此，结果有三列前缀为“Filter_type”：

+----+-----------------+----------+-----------------+-----------------+-----------------+--------------+--------------+--------------+---------------------------+-------------------------+
|    |   Filter_type_x |   Offset |   Filter_type_y |   Filter_type_x |   Filter_type_y |   Sign_pos_X |   Sign_pos_Y |   Sign_pos_Z |   Supplementary_sign_type |   vision_only_sign_type |
|----+-----------------+----------+-----------------+-----------------+-----------------+--------------+--------------+--------------+---------------------------+-------------------------|
|  0 |               0 |  6333.14 |             nan |             nan |             nan |         11.5 |         -3   |          1   |                         0 |                       7 |
|  1 |             nan |  6347.42 |               0 |             nan |             nan |         25   |          8   |          1   |                         0 |                       9 |
|  2 |             nan |  7002.41 |             nan |               0 |             nan |         25.5 |         -7.5 |          1   |                         0 |                       9 |
|  3 |             nan |  7015.85 |             nan |             nan |               0 |         38   |         -0.5 |          7.5 |                         0 |                      35 |
+----+-----------------+----------+-----------------+-----------------+-----------------+--------------+--------------+--------------+---------------------------+-------------------------+

+----+----------+--------------+--------------+--------------+---------------------------+-------------------------+---------------+
|    |   Offset |   Sign_pos_X |   Sign_pos_Y |   Sign_pos_Z |   Supplementary_sign_type |   vision_only_sign_type |   Filter_type |
|----+----------+--------------+--------------+--------------+---------------------------+-------------------------+---------------|
|  0 |  6333.14 |         11.5 |         -3   |          1   |                         0 |                       7 |             0 |
|  1 |  6347.42 |         25   |          8   |          1   |                         0 |                       9 |             0 |
|  2 |  7002.41 |         25.5 |         -7.5 |          1   |                         0 |                       9 |             0 |
|  3 |  7015.85 |         38   |         -0.5 |          7.5 |                         0 |                      35 |             0 |
+----+----------+--------------+--------------+--------------+---------------------------+-------------------------+---------------+

我的问题是：如何强制合并/合并将所有“筛选类型”列合并为一列。您可以看到，在所有这些列中，每行只有一个值，而其他的都是NaN结果应如下所示（只有一个合并列“Filter_type”）：

当数据帧的长度或绝对数量较大时，在循环中调用

pd.merge

会降低性能。因此，如果可能的话，避免这种情况

这里，当数据帧具有

Time

和

Filter\u type

列时，我们似乎希望垂直连接数据帧，而当数据帧缺少

Filter\u type

列时，我们希望水平连接数据帧：

frames = [df.set_index('Time') for df in frames]
filter_type_frames = pd.concat(frames[:4], axis=0)
result = pd.concat([filter_type_frames] + frames[4:], axis=1)
result = result.reset_index('Time')
print(result)

调用

pd.concat

时使用

axis=0

垂直连接，使用

axis=1

水平连接。由于

pd.concat

接受数据帧列表，并且可以一次连接所有数据帧，而无需迭代创建中间数据帧，

pd.concat

避免了二次复制问题

由于

pd.concat

会对齐索引，因此通过将索引设置为

Time

，数据将根据

Time

正确对齐

请参阅下面的可运行示例

还有另一种方法可以解决这个问题，在某种程度上它更漂亮，但是它在一个循环中调用

pd.merge

，因此它可能会因为上面解释的原因而性能不佳

然而，其思想是：默认情况下，

pd.merge（left，right）

合并所有

left

和

right

共享的列标签。因此，如果省略

on='Offset'

（或'on='Time'？）并使用

然后合并将在

偏移量

（或

时间

）和

过滤器类型

上合并（如果两者都存在）

您可以通过使用

import functools
df2 = functools.reduce(functools.partial(pd.merge, how='outer'), df['Frames'])

循环隐藏在

functools.reduce

中，但本质上，

pd.merge

仍在循环中被调用。因此，虽然这很漂亮，但可能无法实现

印刷品

   Filter_type         Time  Sign_pos_X  Sign_pos_Y  Sign_pos_Z  \
0          0.0  6333.137168        11.5        -3.0         1.0   
1          0.0  6347.422576        25.0         8.0         1.0   
2          0.0  7002.406185        25.5        -7.5         1.0   
3          0.0  7015.845717        38.0        -0.5         7.5   

   Supplementary_sign_type  vision_only_sign_type  
0                      0.0                    7.0  
1                      0.0                    9.0  
2                      0.0                    9.0  
3                      0.0                   35.0

非常好的解决方案。同时，我还提出了连接第一帧的解决方案。但我真的很喜欢你的电话。我也会看看的！

import functools
import pandas as pd
frames = [pd.DataFrame({'Filter_type': [0.0], 'Time': [6333.137168]}),
          pd.DataFrame({'Filter_type': [0.0], 'Time': [6347.422576]}),
          pd.DataFrame({'Filter_type': [0.0], 'Time': [7002.406185]}),
          pd.DataFrame({'Filter_type': [0.0], 'Time': [7015.845717]}),
          pd.DataFrame({'Sign_pos_X': [11.5, 25.0, 25.5, 38.0],
                        'Time': [6333.137168, 6347.422576, 7002.406185, 7015.845717]}),
          pd.DataFrame({'Sign_pos_Y': [-3.0, 8.0, -7.5, -0.5],
                        'Time': [6333.137168, 6347.422576, 7002.406185, 7015.845717]}),
          pd.DataFrame({'Sign_pos_Z': [1.0, 1.0, 1.0, 7.5],
                        'Time': [6333.137168, 6347.422576, 7002.406185, 7015.845717]}),
          pd.DataFrame({'Supplementary_sign_type': [0.0, 0.0, 0.0, 0.0],
                        'Time': [6333.137168, 6347.422576, 7002.406185, 7015.845717]}),
          pd.DataFrame({'Time': [6333.137168, 6347.422576, 7002.406185, 7015.845717],
                        'vision_only_sign_type': [7.0, 9.0, 9.0, 35.0]})]

result = functools.reduce(functools.partial(pd.merge, how='outer'), frames)
print(result)

frames = [df.set_index('Time') for df in frames]
A = pd.concat(frames[:4], axis=0)
result = pd.concat([A] + frames[4:], axis=1)
result = result.reset_index('Time')
print(result)
# same result

   Filter_type         Time  Sign_pos_X  Sign_pos_Y  Sign_pos_Z  \
0          0.0  6333.137168        11.5        -3.0         1.0   
1          0.0  6347.422576        25.0         8.0         1.0   
2          0.0  7002.406185        25.5        -7.5         1.0   
3          0.0  7015.845717        38.0        -0.5         7.5   

   Supplementary_sign_type  vision_only_sign_type  
0                      0.0                    7.0  
1                      0.0                    9.0  
2                      0.0                    9.0  
3                      0.0                   35.0