Python 在列中搜索字符串并使用字典键对其进行分类
我已经导入了一个从Linkedin导出的电子表格,其中包含了我的关系,并希望对不同级别的人的职位进行分类 所以,我创建了一个字典,其中包含查找每个职位级别的术语 该词典的第一个版本是:Python 在列中搜索字符串并使用字典键对其进行分类,python,pandas,dataframe,dictionary,series,Python,Pandas,Dataframe,Dictionary,Series,我已经导入了一个从Linkedin导出的电子表格,其中包含了我的关系,并希望对不同级别的人的职位进行分类 所以,我创建了一个字典,其中包含查找每个职位级别的术语 该词典的第一个版本是: dicpositions = {'0 - CEO, Founder': ['CEO', 'Founder', 'Co-Founder', 'Cofounder', 'Owner'], '1 - Director of': ['Director', 'Head'],
dicpositions = {'0 - CEO, Founder': ['CEO', 'Founder', 'Co-Founder', 'Cofounder', 'Owner'],
'1 - Director of': ['Director', 'Head'],
'2 - Manager': ['Manager', 'Administrador'],
'3 - Engenheiro': ['Engenheiro', 'Engineering'],
'4 - Consultor': ['Consultor', 'Consultant'],
'5 - Estagiário': ['Estagiário', 'Intern'],
'6 - Desempregado': ['Self-Employed', 'Autônomo'],
'7 - Professor': ['Professor', 'Researcher'] }
sample = pd.Series(data = (['(blank)'], ['Estagiário'], ['Professor', 'Adjunto'],
['CEO', 'and', 'Founder'], ['Engenheiro', 'de', 'Produção'],
['Consultant'], ['Founder', 'and', 'CTO'],
['Intern'], ['Manager', 'Specialist'],
['Administrador', 'de', 'Novos', 'Negócios'],
['Administrador', 'de', 'Serviços']))
我需要一个代码来读取电子表格中的每个位置,检查是否有这些术语,并在另一个特定列中返回相应的键
我正在读取的数据帧的示例数据如下:
dicpositions = {'0 - CEO, Founder': ['CEO', 'Founder', 'Co-Founder', 'Cofounder', 'Owner'],
'1 - Director of': ['Director', 'Head'],
'2 - Manager': ['Manager', 'Administrador'],
'3 - Engenheiro': ['Engenheiro', 'Engineering'],
'4 - Consultor': ['Consultor', 'Consultant'],
'5 - Estagiário': ['Estagiário', 'Intern'],
'6 - Desempregado': ['Self-Employed', 'Autônomo'],
'7 - Professor': ['Professor', 'Researcher'] }
sample = pd.Series(data = (['(blank)'], ['Estagiário'], ['Professor', 'Adjunto'],
['CEO', 'and', 'Founder'], ['Engenheiro', 'de', 'Produção'],
['Consultant'], ['Founder', 'and', 'CTO'],
['Intern'], ['Manager', 'Specialist'],
['Administrador', 'de', 'Novos', 'Negócios'],
['Administrador', 'de', 'Serviços']))
返回:
0 [(blank)]
1 [Estagiário]
2 [Professor, Adjunto]
3 [CEO, and, Founder]
4 [Engenheiro, de, Produção]
5 [Consultant]
6 [Founder, and, CTO]
7 [Intern]
8 [Manager, Specialist]
9 [Administrador, de, Novos, Negócios]
10 [Administrador, de, Serviços]
dtype: object
我已经完成了以下代码:
import pandas as pd
plan = pd.read_excel('SpreadSheet Name.xlsx', sheet_name = 'Positions')
list0 = ['CEO', 'Founder', 'Co-Founder', 'Cofounder', 'Owner']
list1 = ['Director', 'Head']
list2 = ['Manager', 'Administrador']
listgeral = [dic0, dic1, dic2]
def in_list(list_to_search,terms_to_search):
results = [item for item in list_to_search if item in terms_to_search]
if len(results) > 0:
return '0 - CEO, Founder'
else:
pass
plan['PositionLevel'] = plan['Position'].str.split().apply(lambda x: in_list(x, listgeral[0]))
实际产量:
Position PositionLevel
0 '(blank)' None
1 'Estagiário' None
2 'Professor Adjunto' None
3 'CEO and Founder' '0 - CEO, Founder'
4 'Engenheiro de produção' None
5 'Consultant' None
6 'Founder and CTO' '0 - CEO, Founder'
7 'Intern' None
8 'Manager Specialist' None
9 'Administrador de Novos Negócios' None
预期产出:
Position PositionLevel
0 '(blank)' None
1 'Estagiário' '5 - Estagiário'
2 'Professor Adjunto' '7 - Professor'
3 'CEO and Founder' '0 - CEO, Founder'
4 'Engenheiro de produção' '3 - Engenheiro'
5 'Consultant' '4 - Consultor'
6 'Founder and CTO' '0 - CEO, Founder'
7 'Intern' '5 - Estagiário'
8 'Manager Specialist' '2 - Manager'
9 'Administrador de Novos Negócios' '2 - Manager'
首先,我计划为我的listgeral
中的每个列表运行该代码,但我不这么做。然后我开始相信最好将这本应用于一本大词典,就像问题开头的和返回词的键一样
我已尝试将以下代码应用于此程序:
dictest = {'0 - CEO, Founder': ['CEO', 'Founder', 'Co-Founder', 'Cofounder', 'Owner'],
'1 - Director of': ['Director', 'Head'],
'2 - Manager': ['Manager', 'Administrador']}
def in_dic (x, dictest):
for key in dictest:
for elem in dictest[key]:
if elem == x:
return key
return False
其中,dic('CEO',dictest)
中的输出为'0-创始人CEO'
例如,dic('Banana',dictest)
中的输出为False
但我无法从它前进,并将_dic()中的函数应用于我的问题
我非常感谢任何人的帮助
非常感谢。我冒昧地对您的输入进行了一些重构,但以下是我得到的(可能有点过度设计)。简而言之,我们使用一个名为(pip3 install jellyish
,代码取自answer)的库进行模糊字符串匹配,将excel工作表中的位置与dicpositions
中的位置进行匹配,然后将它们映射到同一目录中的类别。以下是导入和匹配函数:
import pandas as pd
import numpy as np
import jellyfish
# Function for fuzzy-matching strings
def get_closest_match(x, list_strings):
best_match = None
highest_jw = 0
# Keep an eye out for "blank" values, they can be strings, e.g. "(blank)", or e.g. NaN values
no_values = ["(blank)", np.nan, None]
if x in no_values:
return "(blank)"
# Find which string most closely matches our input and return it
for current_string in list_strings:
current_score = jellyfish.jaro_winkler(x, current_string)
if current_score > highest_jw:
highest_jw = current_score
best_match = current_string
return best_match
好的,这是您的文件
,为了方便起见,我将其转换为长格式数据帧:
# Translations between keywords and their category, as dict, as provided in question
dicpositions = {'0 - CEO, Founder': ['CEO', 'Founder', 'Co-Founder', 'Cofounder', 'Owner'],
'1 - Director of': ['Director', 'Head'],
'2 - Manager': ['Manager', 'Administrador'],
'3 - Engenheiro': ['Engenheiro', 'Engineering'],
'4 - Consultor': ['Consultor', 'Consultant'],
'5 - Estagiário': ['Estagiário', 'Intern'],
'6 - Desempregado': ['Self-Employed', 'Autônomo'],
'7 - Professor': ['Professor', 'Researcher'],
'Not found"': ["(blank)"] # <-- I added this to deal with blank values
}
# Let's expand the dict above to a DF, which makes for easier merging later
positions = []
aliases = []
for key, val in dicpositions.items():
for v in val:
positions.append(key)
aliases.append(v)
# This will serve as our mapping table
lookup_table = pd.DataFrame({
"position": positions,
"alias": aliases
})
print(lookup_table)
让我们测试一些输入,看看匹配是如何工作的。我们使用别名
列中的字符串检查输入中的每个字符串,并返回别名
列中与输入数据最匹配的值(稍后我们将再次使用该值来查找类别或位置
):
在我们的test\u df
中添加了一个新列,指示查找表中哪个别名
与我们的test\u位置
输入最为相似:
test_position best_match
0 (blank) (blank)
1 Estagiário Estagiário
2 Professor Adjunto Professor
3 CEO and Founder CEO
4 Engenheiro de produção Engenheiro
5 Consultant Consultant
6 Founder and CTO Founder
7 Intern Intern
8 Manager Specialist Manager
9 Administrador de Novos Negócios Administrador
为了得到该类别,我们只需将测试数据中的最佳匹配
列与查找表中的别名
列合并即可:
result = test_df.merge(lookup_table, left_on="best_match", right_on="alias", how="left")
因此:
test_position best_match alias position
0 (blank) (blank) (blank) Not found
1 Estagiário Estagiário Estagiário 5 - Estagiário
2 Professor Adjunto Professor Professor 7 - Professor
3 CEO and Founder CEO CEO 0 - CEO, Founder
4 Engenheiro de produção Engenheiro Engenheiro 3 - Engenheiro
5 Consultant Consultant Consultant 4 - Consultor
6 Founder and CTO Founder Founder 0 - CEO, Founder
7 Intern Intern Intern 5 - Estagiário
8 Manager Specialist Manager Manager 2 - Manager
9 Administrador de Novos Negócios Administrador Administrador 2 - Manager
等离子,非常感谢!它对我的电子表格非常有效!我仍然需要更好地完成测试
,然后做更多的测试。但是现在它工作得很好,而且它的可扩展性很强!!我真的很感谢你的帮助!