Python 基于部分字符串匹配,从另一个数据帧填充一个数据帧列
我是python编程新手。我有两个数据帧df1包含标签(180k行),df2包含设备名称(1600行) df1: df2: df2.Equipment位于df1.TagName中字符串之间的某个位置。我需要根据df2设备是否在df1标记名中进行匹配,然后df2(设备描述和设备编号)必须与df1匹配 最终输出应为Python 基于部分字符串匹配,从另一个数据帧填充一个数据帧列,python,string,pandas,dataframe,Python,String,Pandas,Dataframe,我是python编程新手。我有两个数据帧df1包含标签(180k行),df2包含设备名称(1600行) df1: df2: df2.Equipment位于df1.TagName中字符串之间的某个位置。我需要根据df2设备是否在df1标记名中进行匹配,然后df2(设备描述和设备编号)必须与df1匹配 最终输出应为 Line TagName quipmentdescription EquipmentNo 187877
Line TagName quipmentdescription EquipmentNo
187877 PT_WOA .ZS01_LA120_T05.SB.S2384_LesSwL Lifting table 1311256
187878 PT_WOA .ZS01_RB2202_T05.SB.S2385_FLOK Roller bed 1311259
187879 PT_WOA .ZS01_LA120_T05.SB._CBAbsHy Lifting table 1311256
187880 PT_WOA .ZS01_LA120_T05.SB.S3110_CBAPV Lifting table 1311256
187881 PT_WOA .ZS01_LARB2204.SB.S3111_CBRelHy Roller bed 1311260
我现在已经试过了
cols= df2['Equipment'].tolist()
Xs=[]
for i in cols:
Test = df1.loc[df1.TagName.str.contains(i)]
Test['Equip']=i
Xs.append(Test)
然后根据“设备”合并xs和df2
但我得到了这个错误
第一个参数必须是字符串或编译模式
我会这样做:
索引
,其中对于df2中的每个设备
,在df1中查找索引列表,其中df1.TagName包含设备
stack()
和reset\u index()
为每个项目创建一行,展平索引
Line TagName EquipmentDescription EquipmentNo
187877 PT_WOA .ZS01_LA120_T05.SB.S2384_LesSwL Lifting table 1311256
187879 PT_WOA .ZS01_LA120_T05.SB._CBAbsHy Lifting table 1311256
187880 PT_WOA .ZS01_LA120_T05.SB.S3110_CBAPV Lifting table 1311256
187878 PT_WOA .ZS01_RB2202_T05.SB.S2385_FLOK Roller bed 1311259
187881 PT_WOA .ZS01_LARB2204.SB.S3111_CBRelHy Roller bed 1311260
我会这样做:
索引
,其中对于df2中的每个设备
,在df1中查找索引列表,其中df1.TagName包含设备
stack()
和reset\u index()
为每个项目创建一行,展平索引
Line TagName EquipmentDescription EquipmentNo
187877 PT_WOA .ZS01_LA120_T05.SB.S2384_LesSwL Lifting table 1311256
187879 PT_WOA .ZS01_LA120_T05.SB._CBAbsHy Lifting table 1311256
187880 PT_WOA .ZS01_LA120_T05.SB.S3110_CBAPV Lifting table 1311256
187878 PT_WOA .ZS01_RB2202_T05.SB.S2385_FLOK Roller bed 1311259
187881 PT_WOA .ZS01_LARB2204.SB.S3111_CBRelHy Roller bed 1311260
初始化提供的数据帧:
import numpy as np
import pandas as pd
df1 = pd.DataFrame([['PT_WOA', '.ZS01_LA120_T05.SB.S2384_LesSwL', 10],
['PT_WOA', '.ZS01_RB2202_T05.SB.S2385_FLOK', 10],
['PT_WOA', '.ZS01_LA120_T05.SB._CBAbsHy', 10],
['PT_WOA', '.ZS01_LA120_T05.SB.S3110_CBAPV', 10],
['PT_WOA', '.ZS01_LARB2204.SB.S3111_CBRelHy', 10]],
columns = ['Line', 'TagName', 'CLASS'],
index = [187877, 187878, 187879, 187880, 187881])
df2 = pd.DataFrame([[1311256, 'Lifting table', 'LA120'],
[1311257, 'Roller bed', 'RB2200'],
[1311258, 'Lifting table', 'LT2202'],
[1311259, 'Roller bed', 'RB2202'],
[1311260, 'Roller bed', 'RB2204']],
columns = ['EquipmentNo', 'EquipmentDescription', 'Equipment'])
我建议如下:
# create a copy of df1, dropping the 'CLASS' column
df3 = df1.drop(columns=['CLASS'])
# add the columns 'EquipmentDescription' and 'Equipment' filled with numpy NaN's
df3['EquipmentDescription'] = np.nan
df3['EquipmentNo'] = np.nan
# for each row in df3, iterate over each row in df2
for index_df3, row_df3 in df3.iterrows():
for index_df2, row_df2 in df2.iterrows():
# check if 'Equipment' is in 'TagName'
if df2.loc[index_df2, 'Equipment'] in df3.loc[index_df3, 'TagName']:
# set 'EquipmentDescription' and 'EquipmentNo'
df3.loc[index_df3, 'EquipmentDescription'] = df2.loc[index_df2, 'EquipmentDescription']
df3.loc[index_df3, 'EquipmentNo'] = df2.loc[index_df2, 'EquipmentNo']
# conver the 'EquipmentNo' to type int
df3['EquipmentNo'] = df3['EquipmentNo'].astype(int)
这将产生以下数据帧:
Line TagName EquipmentDescription EquipmentNo
187877 PT_WOA .ZS01_LA120_T05.SB.S2384_LesSwL Lifting table 1311256
187878 PT_WOA .ZS01_RB2202_T05.SB.S2385_FLOK Roller bed 1311259
187879 PT_WOA .ZS01_LA120_T05.SB._CBAbsHy Lifting table 1311256
187880 PT_WOA .ZS01_LA120_T05.SB.S3110_CBAPV Lifting table 1311256
187881 PT_WOA .ZS01_LARB2204.SB.S3111_CBRelHy Roller bed 1311260
让我知道这是否有帮助。初始化提供的数据帧:
import numpy as np
import pandas as pd
df1 = pd.DataFrame([['PT_WOA', '.ZS01_LA120_T05.SB.S2384_LesSwL', 10],
['PT_WOA', '.ZS01_RB2202_T05.SB.S2385_FLOK', 10],
['PT_WOA', '.ZS01_LA120_T05.SB._CBAbsHy', 10],
['PT_WOA', '.ZS01_LA120_T05.SB.S3110_CBAPV', 10],
['PT_WOA', '.ZS01_LARB2204.SB.S3111_CBRelHy', 10]],
columns = ['Line', 'TagName', 'CLASS'],
index = [187877, 187878, 187879, 187880, 187881])
df2 = pd.DataFrame([[1311256, 'Lifting table', 'LA120'],
[1311257, 'Roller bed', 'RB2200'],
[1311258, 'Lifting table', 'LT2202'],
[1311259, 'Roller bed', 'RB2202'],
[1311260, 'Roller bed', 'RB2204']],
columns = ['EquipmentNo', 'EquipmentDescription', 'Equipment'])
我建议如下:
# create a copy of df1, dropping the 'CLASS' column
df3 = df1.drop(columns=['CLASS'])
# add the columns 'EquipmentDescription' and 'Equipment' filled with numpy NaN's
df3['EquipmentDescription'] = np.nan
df3['EquipmentNo'] = np.nan
# for each row in df3, iterate over each row in df2
for index_df3, row_df3 in df3.iterrows():
for index_df2, row_df2 in df2.iterrows():
# check if 'Equipment' is in 'TagName'
if df2.loc[index_df2, 'Equipment'] in df3.loc[index_df3, 'TagName']:
# set 'EquipmentDescription' and 'EquipmentNo'
df3.loc[index_df3, 'EquipmentDescription'] = df2.loc[index_df2, 'EquipmentDescription']
df3.loc[index_df3, 'EquipmentNo'] = df2.loc[index_df2, 'EquipmentNo']
# conver the 'EquipmentNo' to type int
df3['EquipmentNo'] = df3['EquipmentNo'].astype(int)
这将产生以下数据帧:
Line TagName EquipmentDescription EquipmentNo
187877 PT_WOA .ZS01_LA120_T05.SB.S2384_LesSwL Lifting table 1311256
187878 PT_WOA .ZS01_RB2202_T05.SB.S2385_FLOK Roller bed 1311259
187879 PT_WOA .ZS01_LA120_T05.SB._CBAbsHy Lifting table 1311256
187880 PT_WOA .ZS01_LA120_T05.SB.S3110_CBAPV Lifting table 1311256
187881 PT_WOA .ZS01_LARB2204.SB.S3111_CBRelHy Roller bed 1311260
如果有帮助,请告诉我。- 给定
和df1
如下:df2
df1
|行|标记名|类|
|---:|:-------|:--------------------------------|--------:|
|0 | PT|u WOA | ZS01 | u LA120 | u T05.SB.S2384 | LesSwL | 10|
|1 | PT|u WOA | ZS01 | u RB2202 | u T05.SB.S2385 | u FLOK | 10|
|2 | PT|u WOA | ZS01 | u LA120 | u T05.SB | u CBAbsHy | 10|
|3 | PT|u WOA | ZS01 | u LA120 | u T05.SB.S3110 | u CBAPV | 10|
|4 | PT|u WOA | ZS01 | u LARB2204.SB.S3111 | u CBRelHy | 10|
df2
| |设备编号|设备说明|设备|
|---:|--------------:|:-----------------------|:------------|
|0 | 1311256 |升降台| LA120|
|1 | 1311257 |辊道| RB2200|
|2 | 1311258 |升降台| LT2202|
|3 | 1311259 |辊道| RB2202|
|4 | 1311260 |辊道| RB2204|
df2
device=df2.device.unique().tolist()
设备
df1['Equipment']=df1['TagName'].apply(lambda x:''.join([如果零件在x中,则零件在设备中])
设备
合并为最终形式
- 如果您不想在
中使用df_final
列,请在下一行代码的末尾添加设备
.drop(columns=['device'])
df_final=df1[['Line','TagName','Equipment']]。合并(df2,on='Equipment')
df_final
| |行|标记名|设备|设备号|设备说明|
|---:|:-------|:--------------------------------|:------------|--------------:|:-----------------------|
|0 | PT|u WOA | ZS01 | u LA120 | u T05.SB.S2384 | LesSwL | LA120 | 1311256 |升降台|
|1 | PT|u WOA | ZS01 | u LA120 | u T05.SB.| CBAbsHy | LA120 | 1311256 |升降台|
|2 | PT|U WOA | ZS01 | U LA120 | U T05.SB.S3110 | U CBAPV | LA120 | 1311256 |升降台|
|3 | PT|U WOA | ZS01 | U RB2202 | U T05.SB.S2385 | U FLOK | RB2202 | 1311259 |辊道|
|4 | PT|u WOA | ZS01 | u LARB2204.SB.S3111 | CBRelHy | RB2204 | 1311260 |辊道|
- 给定
和df1
如下:df2
df1
|行|标记名|类|
|---:|:-------|:--------------------------------|--------:|
|0 | PT|u WOA | ZS01 | u LA120 | u T05.SB.S2384 | LesSwL | 10|
|1 | PT|u WOA | ZS01 | u RB2202 | u T05.SB.S2385 | u FLOK | 10|
|2 | PT|u WOA | ZS01 | u LA120 | u T05.SB | u CBAbsHy | 10|
|3 | PT|u WOA | ZS01 | u LA120 | u T05.SB.S3110 | u CBAPV | 10|
|4 | PT|u WOA | ZS01 | u LARB2204.SB.S3111 | u CBRelHy | 10|
df2
| |设备编号|设备说明|设备|
|---:|--------------:|:-----------------------|:------------|
|0 | 1311256 |升降台| LA120|
|1 | 1311257 |辊道| RB2200|
|2 | 1311258 |升降台| LT2202|
|3 | 1311259 |辊道| RB2202|
|4 | 1311260 |辊道| RB2204|
df2
device=df2.device.unique().tolist()
设备
df1['Equipment']=df1['TagName'].apply(lambda x:''.join([如果零件在x中,则零件在设备中])
设备
合并为最终形式
- 如果您不想在
中使用df_final
列,请在下一行代码的末尾添加设备
.drop(columns=['device'])
df_-fin