Python 如果另一个pandas数据帧中包含完整字符串,则为pandas
我想使用数据帧对部件进行分类 简化问题以尝试显示问题:Python 如果另一个pandas数据帧中包含完整字符串,则为pandas,python,pandas,dictionary,dataframe,Python,Pandas,Dictionary,Dataframe,我想使用数据帧对部件进行分类 简化问题以尝试显示问题: data = {'col1': ['engine','blue engine cover','spark plug', 'rear panel','black rear panel', 'blue engine']} desc_df = pd.DataFrame(data=data) catg = {'bodywork': ['engine cover','side panel','rear panel'],'underh
data = {'col1': ['engine','blue engine cover','spark plug',
'rear panel','black rear panel', 'blue engine']}
desc_df = pd.DataFrame(data=data)
catg = {'bodywork': ['engine cover','side panel','rear panel'],'underhood':['engine','spark plug','oil filter'],
'Glass':['Windscreen','window','demister']}
catg_df = pd.DataFrame(data=catg)
catg_df
Glass bodywork underhood
0 Windscreen engine cover engine
1 window side panel spark plug
2 demister rear panel oil filter
desc_df
col1
0 engine
1 blue engine cover
2 spark plug
3 rear panel
4 black rear panel
5 blue engine
最后,我想说:
col1 Category
0 engine underhood
1 blue engine cover underhood
2 spark plug underhood
3 rear panel bodywork
4 black rear panel bodywork
5 blue engine underhood
我得出的最接近的结论是:
d=catg_df.apply('|'.join).to_dict()
desc_df['Category'] = desc_df['col1'].apply(lambda x : ''.join([z if pd.Series(x).str.contains(y).values else '' for z,y in d.items()]))
但我最终在字符串中找到了“engine”和“engine cover”:
描述
col1 Category
0 engine underhood
1 blue engine cover bodyworkunderhood
2 spark plug underhood
3 rear panel bodywork
4 black rear panel bodywork
5 blue engine underhood
如果它先找到“engine Cover”,然后使用此类别进行分类,而不转到“engine”,那么我是否可以使用某种方法来解决此问题。您可以通过迭代字典来解决此问题:
from collections import OrderedDict
d = OrderedDict([(k, '|'.join(catg_df[k].tolist())) for k in catg_df.columns[::-1]])
for k, v in d.items():
desc_df.loc[desc_df['col1'].str.contains(v), 'Category'] = k
结果
print(desc_df)
col1 Category
0 engine underhood
1 blue engine cover bodywork
2 spark plug underhood
3 rear panel bodywork
4 black rear panel bodywork
5 blue engine underhood
解释
- 对于字典中的每个项目,检查
条件与正则表达式值,并将键分配给“Category”列str.contains
- 使用
为列赋予优先级collections.OrderedDict
- 在这种情况下,可以在构建
时反转列的迭代顺序d
difflib
获取最接近的值和lambda
:
首先创建映射器:
from difflib import get_close_matches
mapper = {val:k for k, v in catg_df.to_dict('list').items() for val in v}
print(mapper)
因此,映射器将如下所示:
{'Windscreen': 'Glass',
'demister': 'Glass',
'engine': 'underhood',
'engine cover': 'bodywork',
'oil filter': 'underhood',
'rear panel': 'bodywork',
'side panel': 'bodywork',
'spark plug': 'underhood',
'window': 'Glass'}
现在,使用lambda
和difflib
查找最接近的值:
# avoid calling mapper.keys() in lambda
keys = mapper.keys()
desc_df['Category'] = desc_df['col1'].apply(lambda row: mapper[get_close_matches(row, keys)[0]])
结果:
col1 Category
0 engine underhood
1 blue engine cover bodywork
2 spark plug underhood
3 rear panel bodywork
4 black rear panel bodywork
5 blue engine underhood