Python 基于部分字符串匹配，从另一个数据帧填充一个数据帧列_Python_String_Pandas_Dataframe

Python 基于部分字符串匹配，从另一个数据帧填充一个数据帧列

python string pandas dataframe

Python 基于部分字符串匹配，从另一个数据帧填充一个数据帧列,python,string,pandas,dataframe,Python,String,Pandas,Dataframe,我是python编程新手。我有两个数据帧df1包含标签（180k行），df2包含设备名称（1600行） df1： df2： df2.Equipment位于df1.TagName中字符串之间的某个位置。我需要根据df2设备是否在df1标记名中进行匹配，然后df2（设备描述和设备编号）必须与df1匹配最终输出应为 Line TagName quipmentdescription EquipmentNo 187877

我是python编程新手。我有两个数据帧df1包含标签（180k行），df2包含设备名称（1600行）

df1：

df2：

df2.Equipment位于df1.TagName中字符串之间的某个位置。我需要根据df2设备是否在df1标记名中进行匹配，然后df2（设备描述和设备编号）必须与df1匹配

最终输出应为

        Line                TagName                quipmentdescription   EquipmentNo 
187877  PT_WOA  .ZS01_LA120_T05.SB.S2384_LesSwL     Lifting table        1311256
187878  PT_WOA  .ZS01_RB2202_T05.SB.S2385_FLOK      Roller bed           1311259  
187879  PT_WOA  .ZS01_LA120_T05.SB._CBAbsHy         Lifting table        1311256 
187880  PT_WOA  .ZS01_LA120_T05.SB.S3110_CBAPV      Lifting table        1311256
 187881 PT_WOA  .ZS01_LARB2204.SB.S3111_CBRelHy     Roller bed           1311260

我现在已经试过了

cols= df2['Equipment'].tolist()
Xs=[]
for i in cols:
    Test = df1.loc[df1.TagName.str.contains(i)] 
    Test['Equip']=i
    Xs.append(Test)

然后根据“设备”合并xs和df2

但我得到了这个错误

第一个参数必须是字符串或编译模式

我会这样做：

创建一个新列

索引

，其中对于df2中的每个

设备

，在df1中查找索引列表，其中df1.TagName包含

设备

通过使用

stack（）

和

reset\u index（）

为每个项目创建一行，展平

索引


将展平df2与df1连接起来，以获得所需的所有信息
输出：
          Line                          TagName EquipmentDescription  EquipmentNo
187877  PT_WOA  .ZS01_LA120_T05.SB.S2384_LesSwL        Lifting table      1311256
187879  PT_WOA      .ZS01_LA120_T05.SB._CBAbsHy        Lifting table      1311256
187880  PT_WOA   .ZS01_LA120_T05.SB.S3110_CBAPV        Lifting table      1311256
187878  PT_WOA   .ZS01_RB2202_T05.SB.S2385_FLOK           Roller bed      1311259
187881  PT_WOA  .ZS01_LARB2204.SB.S3111_CBRelHy           Roller bed      1311260

我会这样做：
创建一个新列索引
，其中对于df2中的每个设备
，在df1中查找索引列表，其中df1.TagName包含设备

通过使用stack（）
和reset\u index（）
为每个项目创建一行，展平索引

将展平df2与df1连接起来，以获得所需的所有信息
输出：
          Line                          TagName EquipmentDescription  EquipmentNo
187877  PT_WOA  .ZS01_LA120_T05.SB.S2384_LesSwL        Lifting table      1311256
187879  PT_WOA      .ZS01_LA120_T05.SB._CBAbsHy        Lifting table      1311256
187880  PT_WOA   .ZS01_LA120_T05.SB.S3110_CBAPV        Lifting table      1311256
187878  PT_WOA   .ZS01_RB2202_T05.SB.S2385_FLOK           Roller bed      1311259
187881  PT_WOA  .ZS01_LARB2204.SB.S3111_CBRelHy           Roller bed      1311260

初始化提供的数据帧：
import numpy as np
import pandas as pd

df1 = pd.DataFrame([['PT_WOA', '.ZS01_LA120_T05.SB.S2384_LesSwL', 10],
                    ['PT_WOA', '.ZS01_RB2202_T05.SB.S2385_FLOK', 10],
                    ['PT_WOA', '.ZS01_LA120_T05.SB._CBAbsHy', 10],
                    ['PT_WOA', '.ZS01_LA120_T05.SB.S3110_CBAPV', 10],
                    ['PT_WOA', '.ZS01_LARB2204.SB.S3111_CBRelHy', 10]],
                   columns = ['Line', 'TagName', 'CLASS'],
                   index = [187877, 187878, 187879, 187880, 187881])

df2 = pd.DataFrame([[1311256, 'Lifting table', 'LA120'],
                    [1311257, 'Roller bed', 'RB2200'],
                    [1311258, 'Lifting table', 'LT2202'],
                    [1311259, 'Roller bed', 'RB2202'],
                    [1311260, 'Roller bed', 'RB2204']],
                  columns = ['EquipmentNo', 'EquipmentDescription', 'Equipment'])

我建议如下：
# create a copy of df1, dropping the 'CLASS' column
df3 = df1.drop(columns=['CLASS'])

# add the columns 'EquipmentDescription' and 'Equipment' filled with numpy NaN's
df3['EquipmentDescription'] = np.nan
df3['EquipmentNo'] = np.nan

# for each row in df3, iterate over each row in df2
for index_df3, row_df3 in df3.iterrows():
    for index_df2, row_df2 in df2.iterrows():

        # check if 'Equipment' is in 'TagName'
        if df2.loc[index_df2, 'Equipment'] in df3.loc[index_df3, 'TagName']:

            # set 'EquipmentDescription' and 'EquipmentNo'
            df3.loc[index_df3, 'EquipmentDescription'] = df2.loc[index_df2, 'EquipmentDescription']
            df3.loc[index_df3, 'EquipmentNo'] = df2.loc[index_df2, 'EquipmentNo']


# conver the 'EquipmentNo' to type int
df3['EquipmentNo'] = df3['EquipmentNo'].astype(int)


这将产生以下数据帧：
        Line    TagName                         EquipmentDescription EquipmentNo
187877  PT_WOA  .ZS01_LA120_T05.SB.S2384_LesSwL Lifting table        1311256
187878  PT_WOA  .ZS01_RB2202_T05.SB.S2385_FLOK  Roller bed           1311259
187879  PT_WOA  .ZS01_LA120_T05.SB._CBAbsHy     Lifting table        1311256
187880  PT_WOA  .ZS01_LA120_T05.SB.S3110_CBAPV  Lifting table        1311256
187881  PT_WOA  .ZS01_LARB2204.SB.S3111_CBRelHy Roller bed           1311260

让我知道这是否有帮助。
初始化提供的数据帧：
import numpy as np
import pandas as pd

df1 = pd.DataFrame([['PT_WOA', '.ZS01_LA120_T05.SB.S2384_LesSwL', 10],
                    ['PT_WOA', '.ZS01_RB2202_T05.SB.S2385_FLOK', 10],
                    ['PT_WOA', '.ZS01_LA120_T05.SB._CBAbsHy', 10],
                    ['PT_WOA', '.ZS01_LA120_T05.SB.S3110_CBAPV', 10],
                    ['PT_WOA', '.ZS01_LARB2204.SB.S3111_CBRelHy', 10]],
                   columns = ['Line', 'TagName', 'CLASS'],
                   index = [187877, 187878, 187879, 187880, 187881])

df2 = pd.DataFrame([[1311256, 'Lifting table', 'LA120'],
                    [1311257, 'Roller bed', 'RB2200'],
                    [1311258, 'Lifting table', 'LT2202'],
                    [1311259, 'Roller bed', 'RB2202'],
                    [1311260, 'Roller bed', 'RB2204']],
                  columns = ['EquipmentNo', 'EquipmentDescription', 'Equipment'])

我建议如下：
# create a copy of df1, dropping the 'CLASS' column
df3 = df1.drop(columns=['CLASS'])

# add the columns 'EquipmentDescription' and 'Equipment' filled with numpy NaN's
df3['EquipmentDescription'] = np.nan
df3['EquipmentNo'] = np.nan

# for each row in df3, iterate over each row in df2
for index_df3, row_df3 in df3.iterrows():
    for index_df2, row_df2 in df2.iterrows():

        # check if 'Equipment' is in 'TagName'
        if df2.loc[index_df2, 'Equipment'] in df3.loc[index_df3, 'TagName']:

            # set 'EquipmentDescription' and 'EquipmentNo'
            df3.loc[index_df3, 'EquipmentDescription'] = df2.loc[index_df2, 'EquipmentDescription']
            df3.loc[index_df3, 'EquipmentNo'] = df2.loc[index_df2, 'EquipmentNo']


# conver the 'EquipmentNo' to type int
df3['EquipmentNo'] = df3['EquipmentNo'].astype(int)


这将产生以下数据帧：
        Line    TagName                         EquipmentDescription EquipmentNo
187877  PT_WOA  .ZS01_LA120_T05.SB.S2384_LesSwL Lifting table        1311256
187878  PT_WOA  .ZS01_RB2202_T05.SB.S2385_FLOK  Roller bed           1311259
187879  PT_WOA  .ZS01_LA120_T05.SB._CBAbsHy     Lifting table        1311256
187880  PT_WOA  .ZS01_LA120_T05.SB.S3110_CBAPV  Lifting table        1311256
187881  PT_WOA  .ZS01_LARB2204.SB.S3111_CBRelHy Roller bed           1311260

如果有帮助，请告诉我。

给定df1
和df2
如下：

df1
|行|标记名|类|
|---:|:-------|:--------------------------------|--------:|
|0 | PT|u WOA | ZS01 | u LA120 | u T05.SB.S2384 | LesSwL | 10|
|1 | PT|u WOA | ZS01 | u RB2202 | u T05.SB.S2385 | u FLOK | 10|
|2 | PT|u WOA | ZS01 | u LA120 | u T05.SB | u CBAbsHy | 10|
|3 | PT|u WOA | ZS01 | u LA120 | u T05.SB.S3110 | u CBAPV | 10|
|4 | PT|u WOA | ZS01 | u LARB2204.SB.S3111 | u CBRelHy | 10|

df2
| |设备编号|设备说明|设备|
|---:|--------------:|:-----------------------|:------------|
|0 | 1311256 |升降台| LA120|
|1 | 1311257 |辊道| RB2200|
|2 | 1311258 |升降台| LT2202|
|3 | 1311259 |辊道| RB2202|
|4 | 1311260 |辊道| RB2204|

在df2
device=df2.device.unique（）.tolist（）

通过在设备
df1['Equipment']=df1['TagName'].apply（lambda x:''.join（[如果零件在x中，则零件在设备中]）

将设备合并为最终形式

如果您不想在df_final
中使用设备
列，请在下一行代码的末尾添加.drop（columns=['device']）

df_final=df1[['Line'，'TagName'，'Equipment']]。合并（df2，on='Equipment'）

df_final
| |行|标记名|设备|设备号|设备说明|
|---:|:-------|:--------------------------------|:------------|--------------:|:-----------------------|
|0 | PT|u WOA | ZS01 | u LA120 | u T05.SB.S2384 | LesSwL | LA120 | 1311256 |升降台|
|1 | PT|u WOA | ZS01 | u LA120 | u T05.SB.| CBAbsHy | LA120 | 1311256 |升降台|
|2 | PT|U WOA | ZS01 | U LA120 | U T05.SB.S3110 | U CBAPV | LA120 | 1311256 |升降台|
|3 | PT|U WOA | ZS01 | U RB2202 | U T05.SB.S2385 | U FLOK | RB2202 | 1311259 |辊道|
|4 | PT|u WOA | ZS01 | u LARB2204.SB.S3111 | CBRelHy | RB2204 | 1311260 |辊道|

给定df1
和df2
如下：

df1
|行|标记名|类|
|---:|:-------|:--------------------------------|--------:|
|0 | PT|u WOA | ZS01 | u LA120 | u T05.SB.S2384 | LesSwL | 10|
|1 | PT|u WOA | ZS01 | u RB2202 | u T05.SB.S2385 | u FLOK | 10|
|2 | PT|u WOA | ZS01 | u LA120 | u T05.SB | u CBAbsHy | 10|
|3 | PT|u WOA | ZS01 | u LA120 | u T05.SB.S3110 | u CBAPV | 10|
|4 | PT|u WOA | ZS01 | u LARB2204.SB.S3111 | u CBRelHy | 10|

df2
| |设备编号|设备说明|设备|
|---:|--------------:|:-----------------------|:------------|
|0 | 1311256 |升降台| LA120|
|1 | 1311257 |辊道| RB2200|
|2 | 1311258 |升降台| LT2202|
|3 | 1311259 |辊道| RB2202|
|4 | 1311260 |辊道| RB2204|

在df2
device=df2.device.unique（）.tolist（）

通过在设备
df1['Equipment']=df1['TagName'].apply（lambda x:''.join（[如果零件在x中，则零件在设备中]）

将设备合并为最终形式

如果您不想在df_final
中使用设备
列，请在下一行代码的末尾添加.drop（columns=['device']）

df_-fin