Python str.contains Panda自定义函数中出错
我有一个专栏,里面有很多博士专业。我想清理它并在下面创建一个函数:Python str.contains Panda自定义函数中出错,python,pandas,dataframe,Python,Pandas,Dataframe,我有一个专栏,里面有很多博士专业。我想清理它并在下面创建一个函数: def specialty(x): if x.str.contains('Urolog'): return 'Urology' elif x.str.contains('Nurse'): return 'Nurse Practioner' elif x.str.contains('Oncology'):
def specialty(x):
if x.str.contains('Urolog'):
return 'Urology'
elif x.str.contains('Nurse'):
return 'Nurse Practioner'
elif x.str.contains('Oncology'):
return 'Oncology'
elif x.str.contains('Physician'):
return 'Physician Assistant'
elif x.str.contains('Family Medicine'):
return 'Family Medicine'
elif x.str.contains('Anesthes'):
return 'Anesthesiology'
else:
return 'Other'
df['desc_clean'] = df['desc'].apply(specialty)
但是我得到一个错误TypeError:“function”对象不可下标
值太多,无法使用手动映射,因此我想使用str.contains。有没有更好的办法
编辑:示例DF
{'person_id': {39063: 33081476009,
50538: 33033519093,
56075: 33170508793,
36593: 33061707789,
51656: 33047685345,
95512: 33022026049,
40286: 33038034707,
3887: 33076466195,
40161: 33052807819,
52905: 33190526939,
35418: 33008425164,
35934: 33015737122,
3389: 33055125864,
136: 33139641318,
105460: 33113871389,
52568: 33075745388,
24725: 33052090907,
34838: 33205449839,
31908: 33183672635,
36115: 33006692696},
'final_desc': {39063: 'None',
50538: 'Urology',
56075: 'Anesthesiology',
36593: 'None',
51656: 'Urology',
95512: 'None',
40286: 'Anesthesiology',
3887: 'Specialist',
40161: 'None',
52905: 'Anesthesiology',
35418: 'Urology',
35934: 'None',
3389: 'Ophthalmology',
136: 'Rheumatology',
105460: 'None',
52568: 'Urology',
24725: 'Family Medicine',
34838: 'None',
31908: 'Nurse Practitioner',
36115: 'None'}}
您可以直接使用索引进行迭代:
ix = df[df.desc.str.contains('Urolog')].index
df.loc[ix, 'desc_clean'] = "Urology"
因此,迭代将类似于:
dict_specialties = {"Urolog":"Urology",}
for key, val in dict_specialties.items():
ix = df[df.desc.str.contains(key)].index
df.loc[ix, 'desc_clean'] = val
为此,我们可以定义匹配项之间的映射,然后遍历它们并设置列的值,跟踪已更改的列。最后,我们从未匹配过的任何列都会设置为
“Other”
mapping = {'Urolog': 'Urology',
'Nurse': 'Nurse Practioner',
'Oncology': 'Oncology',
'Physician': 'Physician Assistant',
'Family Medicine': 'Family Medicine',
'Anesthes': 'Anesthesiology'}
def specialty(column):
column = column.copy()
matches = pd.Series(False, index=column.index)
for k,v in mapping.items():
match = column.str.contains(k)
column[match] = v
matches[match] = True
column[~matches] = 'Other'
return column
specialty(df['final_desc'])
39063 Other
50538 Urology
56075 Anesthesiology
36593 Other
51656 Urology
95512 Other
40286 Anesthesiology
3887 Other
40161 Other
52905 Anesthesiology
35418 Urology
35934 Other
3389 Other
136 Other
105460 Other
52568 Urology
24725 Family Medicine
34838 Other
31908 Nurse Practioner
36115 Other
Name: final_desc, dtype: object
专业函数接收的x是字符串本身。所以没有x.str,因为它是字符串,所以您可以使用“in”进行检查,如下所示。修改了一些数据以查看结果 提示:您应该使用字典或列表,而不是使用elif链 代码: 输出:
person_id final_desc desc_clean
39063 33081476009 None Other
50538 33033519093 Urolog Urology
56075 33170508793 Anesthes Anesthesiology
36593 33061707789 None Other
51656 33047685345 Urology Other
95512 33022026049 None Other
40286 33038034707 Anesthes Anesthesiology
3887 33076466195 Specialist Other
40161 33052807819 None Other
52905 33190526939 Anesthesiology Other
35418 33008425164 Urology Other
35934 33015737122 None Other
3389 33055125864 Ophthalmology Other
136 33139641318 Rheumatology Other
105460 33113871389 None Other
52568 33075745388 Urology Other
24725 33052090907 Family Medicine Family Medicine
34838 33205449839 None Other
31908 33183672635 Nurse Nurse Practioner
36115 33006692696 None Other
您可以使用类似
fuzzyfuzzy
的库进行模糊字符串匹配。这种方法的好处是比某些规则集更灵活,如下所示
此解决方案生成子字符串和候选类别的最大分数,返回最匹配的一个。如果低于阈值,则返回默认值(“无”):
结果:
person_id final_desc desc
52568 33075745388 Urology urologist
36593 33061707789 Nurse Practioner nruse practition
136 33139641318 Specialist oncology specialist
50538 33033519093 Physician Assistant physicians assistant
3389 33055125864 Family Medicine fam. medicine
51656 33047685345 Anesthesiology anesthesiology
35418 33008425164 Anesthesiology anesthesiologist
52905 33190526939 Nurse Practioner Nurses practitioner
36115 33006692696 Specialist Occupational specialist
31908 33183672635 Oncology Oncologist
您能提供数据帧的示例吗?可能是df.sample(n=20).to_dict()或其他添加的东西!谢谢这看起来像输出-输入文本/列如何?添加了一个您可能感兴趣的模糊匹配解决方案。
from fuzzywuzzy import fuzz
CATEGORIES = [
'Urology',
'Nurse Practioner',
'Oncology',
'Physician Assistant',
'Family Medicine',
'Anesthesiology',
'Specialist',
]
def best_match(
text,
categories=CATEGORIES,
default="None",
threshold=65
):
matches = {fuzz.partial_ratio(cat, text): cat for cat in categories}
best_score = max(matches)
best_match = matches[best_score]
if best_score >= threshold:
return best_match
else:
return default
df["final_desc"] = df.desc.apply(best_match)
person_id final_desc desc
52568 33075745388 Urology urologist
36593 33061707789 Nurse Practioner nruse practition
136 33139641318 Specialist oncology specialist
50538 33033519093 Physician Assistant physicians assistant
3389 33055125864 Family Medicine fam. medicine
51656 33047685345 Anesthesiology anesthesiology
35418 33008425164 Anesthesiology anesthesiologist
52905 33190526939 Nurse Practioner Nurses practitioner
36115 33006692696 Specialist Occupational specialist
31908 33183672635 Oncology Oncologist