Python str.contains Panda自定义函数中出错_Python_Pandas_Dataframe

Python str.contains Panda自定义函数中出错

python pandas dataframe

Python str.contains Panda自定义函数中出错,python,pandas,dataframe,Python,Pandas,Dataframe,我有一个专栏，里面有很多博士专业。我想清理它并在下面创建一个函数： def specialty(x): if x.str.contains('Urolog'): return 'Urology' elif x.str.contains('Nurse'): return 'Nurse Practioner' elif x.str.contains('Oncology'):

我有一个专栏，里面有很多博士专业。我想清理它并在下面创建一个函数：

def specialty(x):
    
        if x.str.contains('Urolog'):
            return 'Urology'
        elif x.str.contains('Nurse'):
            return 'Nurse Practioner'
        elif x.str.contains('Oncology'):
            return 'Oncology'
        elif x.str.contains('Physician'):
            return 'Physician Assistant'
        elif x.str.contains('Family Medicine'):
            return 'Family Medicine'
        elif x.str.contains('Anesthes'):
            return 'Anesthesiology'
        else:
            return 'Other'

df['desc_clean'] = df['desc'].apply(specialty)

但是我得到一个错误

TypeError:“function”对象不可下标

值太多，无法使用手动映射，因此我想使用str.contains。有没有更好的办法

编辑：示例DF

{'person_id': {39063: 33081476009,
  50538: 33033519093,
  56075: 33170508793,
  36593: 33061707789,
  51656: 33047685345,
  95512: 33022026049,
  40286: 33038034707,
  3887: 33076466195,
  40161: 33052807819,
  52905: 33190526939,
  35418: 33008425164,
  35934: 33015737122,
  3389: 33055125864,
  136: 33139641318,
  105460: 33113871389,
  52568: 33075745388,
  24725: 33052090907,
  34838: 33205449839,
  31908: 33183672635,
  36115: 33006692696},
 'final_desc': {39063: 'None',
  50538: 'Urology',
  56075: 'Anesthesiology',
  36593: 'None',
  51656: 'Urology',
  95512: 'None',
  40286: 'Anesthesiology',
  3887: 'Specialist',
  40161: 'None',
  52905: 'Anesthesiology',
  35418: 'Urology',
  35934: 'None',
  3389: 'Ophthalmology',
  136: 'Rheumatology',
  105460: 'None',
  52568: 'Urology',
  24725: 'Family Medicine',
  34838: 'None',
  31908: 'Nurse Practitioner',
  36115: 'None'}}

您可以直接使用索引进行迭代：

ix = df[df.desc.str.contains('Urolog')].index
df.loc[ix, 'desc_clean'] = "Urology"

因此，迭代将类似于：

dict_specialties = {"Urolog":"Urology",}
for key, val in dict_specialties.items():
  ix = df[df.desc.str.contains(key)].index
  df.loc[ix, 'desc_clean'] = val

为此，我们可以定义匹配项之间的映射，然后遍历它们并设置列的值，跟踪已更改的列。最后，我们从未匹配过的任何列都会设置为

“Other”

mapping = {'Urolog': 'Urology',
 'Nurse': 'Nurse Practioner',
 'Oncology': 'Oncology',
 'Physician': 'Physician Assistant',
 'Family Medicine': 'Family Medicine',
 'Anesthes': 'Anesthesiology'}

def specialty(column):
    column = column.copy()
    matches = pd.Series(False, index=column.index)
    for k,v in mapping.items():
        match = column.str.contains(k)
        column[match] = v
        matches[match] = True
    column[~matches] = 'Other'
    return column


specialty(df['final_desc'])

39063                Other
50538              Urology
56075       Anesthesiology
36593                Other
51656              Urology
95512                Other
40286       Anesthesiology
3887                 Other
40161                Other
52905       Anesthesiology
35418              Urology
35934                Other
3389                 Other
136                  Other
105460               Other
52568              Urology
24725      Family Medicine
34838                Other
31908     Nurse Practioner
36115                Other
Name: final_desc, dtype: object

专业函数接收的x是字符串本身。所以没有x.str，因为它是字符串，所以您可以使用“in”进行检查，如下所示。修改了一些数据以查看结果提示：您应该使用字典或列表，而不是使用elif链

代码：

输出：

          person_id       final_desc        desc_clean
39063   33081476009             None             Other
50538   33033519093           Urolog           Urology
56075   33170508793         Anesthes    Anesthesiology
36593   33061707789             None             Other
51656   33047685345          Urology             Other
95512   33022026049             None             Other
40286   33038034707         Anesthes    Anesthesiology
3887    33076466195       Specialist             Other
40161   33052807819             None             Other
52905   33190526939   Anesthesiology             Other
35418   33008425164          Urology             Other
35934   33015737122             None             Other
3389    33055125864    Ophthalmology             Other
136     33139641318     Rheumatology             Other
105460  33113871389             None             Other
52568   33075745388          Urology             Other
24725   33052090907  Family Medicine   Family Medicine
34838   33205449839             None             Other
31908   33183672635            Nurse  Nurse Practioner
36115   33006692696             None             Other

您可以使用类似

fuzzyfuzzy

的库进行模糊字符串匹配。这种方法的好处是比某些规则集更灵活，如下所示

此解决方案生成子字符串和候选类别的最大分数，返回最匹配的一个。如果低于阈值，则返回默认值（“无”）：

结果:

         person_id           final_desc                     desc
52568  33075745388              Urology                urologist
36593  33061707789     Nurse Practioner         nruse practition
136    33139641318           Specialist      oncology specialist
50538  33033519093  Physician Assistant    physicians assistant
3389   33055125864      Family Medicine            fam. medicine
51656  33047685345       Anesthesiology           anesthesiology
35418  33008425164       Anesthesiology         anesthesiologist
52905  33190526939     Nurse Practioner      Nurses practitioner
36115  33006692696           Specialist  Occupational specialist
31908  33183672635             Oncology               Oncologist

您能提供数据帧的示例吗？可能是df.sample（n=20）.to_dict（）或其他添加的东西！谢谢这看起来像输出-输入文本/列如何？添加了一个您可能感兴趣的模糊匹配解决方案。

from fuzzywuzzy import fuzz

CATEGORIES = [
 'Urology',
 'Nurse Practioner',
 'Oncology',
 'Physician Assistant',
 'Family Medicine',
 'Anesthesiology',
 'Specialist',
]    


def best_match(
    text, 
    categories=CATEGORIES, 
    default="None", 
    threshold=65
):
    matches = {fuzz.partial_ratio(cat, text): cat for cat in categories}
    best_score = max(matches)
    best_match = matches[best_score]
    if best_score >= threshold:
        return best_match
    else:
        return default


df["final_desc"] = df.desc.apply(best_match)

         person_id           final_desc                     desc
52568  33075745388              Urology                urologist
36593  33061707789     Nurse Practioner         nruse practition
136    33139641318           Specialist      oncology specialist
50538  33033519093  Physician Assistant    physicians assistant
3389   33055125864      Family Medicine            fam. medicine
51656  33047685345       Anesthesiology           anesthesiology
35418  33008425164       Anesthesiology         anesthesiologist
52905  33190526939     Nurse Practioner      Nurses practitioner
36115  33006692696           Specialist  Occupational specialist
31908  33183672635             Oncology               Oncologist