如何制作函数以减少Python中的重复代码
如果df['Stem']的后缀与药物名称的后缀匹配,我试图从df['Text']系列中提取药物名称如何制作函数以减少Python中的重复代码,python,pandas,dataframe,Python,Pandas,Dataframe,如果df['Stem']的后缀与药物名称的后缀匹配,我试图从df['Text']系列中提取药物名称 print (df['Text']) Text 1/1/11 (USA) neoadjuvant arimidex 1/2/11 radafaxine + cisplatin. 1/3/11 abc letrozole print (df['Stem']) Stem dex zole platin axine amivir arit 预期的产出将是: Drugs arimidex rad
print (df['Text'])
Text
1/1/11 (USA) neoadjuvant arimidex
1/2/11 radafaxine + cisplatin.
1/3/11 abc letrozole
print (df['Stem'])
Stem
dex
zole
platin
axine
amivir
arit
预期的产出将是:
Drugs
arimidex
radafaxine, cisplatin
letrozole
以下是我为提取和创建新系列“药物”所做的工作:
df['dex'] = df['Text'].str.extract(r"(\w+dex)", expand=False)
df['platin'] = df['Text'].str.extract(r"(\w+platin)", expand=False)
df['xine'] = df['Text'].str.extract(r"(\w+xine)", expand=False)
df['zole'] = df['Text'].str.extract(r"(\w+zole)", expand=False)
df['drugs'] = df[df.columns[2:6]].apply(lambda x: ', '.join(x.dropna().astype(str)),axis=1)
df = df.drop(df.columns[2:6], axis=1)
df
Text Stem Drugs
1/1/11 (USA) neoadjuvant arimidex dex arimidex
1/2/11 radafaxine + cisplatin. zole radafaxine, cisplatin
1/3/11 abc letrozole platin letrozole
NaN axine NaN
NaN amivir NaN
NaN arit NaN
但是,它是重复的,我想创建一个函数,可以遍历“文本”,匹配包含后缀的药物名称,并提取它。我想知道是否有办法做到这一点。先谢谢你
更新:
按照MaxU的建议,我创建了一个新的数据框架,它与原始数据相似
print (df['Text_Long'])
Text_Long
2/1/14 (JK) DOCETAXEL, PYPHAMIDE
2/10/14 (JK) NITROZOLE
2/12/14-4/15/14 30MV PHOTONS TO LT arm, JC/WE 500JC IN 25OP
2/22/12 (Kansas/HEM)- NEOADJUVANT KITOTERE, DRYMYCIN, KITOXAN
4/11/11-11/24/11 (JK) CYCLOPHOSPHAMIDE, FLUOROURACIL
4/14/14 (CONN) GEMZAR + OPR. 11/25/14 (CONN) OPR.
4/12/12-10/2/12-KT-RIGHT ARM-5020 NYG, 24 PRESSURE
JK DRUG therapy: aritrozole
NITROZOLE STARTED ON 1/11/12 PER ADVICE
KFC X 2
maritinib & fosclitaxel.
Urioxifen
10/2/12 NEOADJUVANT FLOMIDEX
10/29/12 YUMYCIN, KITOXAN, TACXOL
11/11/14 (JK) GOODZOLE
2/12/12 (CONN) petbine + pastlatin.
4/2014 (CONN) Continue PSCORE for 2 cycles.
2/2015 to 5/2015 OSF (Stinson) XRT
5/19/10-2/21/10 HEMYCIN AND BASKIXAN
5/2/12-5/12/12 1000NY/20FL/30MT/OT A2-A9
2/2015 OPC (JK) DRUG THERAPY`print(stem)
以下是Github上由后缀列表组成的EXCEL文件:
再次感谢您的帮助和建议 假设您具有以下DF:
In [92]: drugs_stem
Out[93]:
Stem
0 dex
1 zole
2 platin
3 axine
4 amivir
5 arit
以及:
您可以执行以下操作:
In [94]: pat = r'\b(\w*(?:{})\w*)\b'.format(drugs_suff.Stem.str.cat(sep='|'))
In [95]: df['Drugs'] = df.Text.str.extractall(pat, flags=re.I).unstack() \
.apply(lambda x:', '.join(x.dropna()), axis=1)
In [96]: df
Out[96]:
Text Drugs
0 1/1/11 (USA) neoadjuvant arimidex arimidex
1 1/2/11 radafaxine + cisplatin. radafaxine, cisplatin
2 1/3/11 abc letrozole letrozole
更新:
In [25]: %paste
drugs_stem = pd.Series(suffix)
pat = r'\b(\w*(?:{})\w*)\b'.format(drugs_stem.str.cat(sep='|'))
df['Drugs'] = df.Text_Long.str.lower().str.extractall(pat).unstack() \
.apply(lambda x:', '.join(x.dropna()), axis=1)
## -- End pasted text --
In [26]: df
Out[26]:
Text_Long Drugs
0 2/1/14 (JK) DOCETAXEL, PYPHAMIDE docetaxel
1 2/10/14 (JK) NITROZOLE nitrozole
2 2/12/14-4/15/14 30MV PHOTONS TO LT arm, JC/WE... NaN
3 2/22/12 (Kansas/HEM)- NEOADJUVANT KITOTERE, DR... drymycin, kitoxan
4 4/11/11-11/24/11 (JK) CYCLOPHOSPHAMIDE, FLUORO... fluorouracil
5 4/14/14 (CONN) GEMZAR + OPR. 11/25/14 (CONN... conn, conn
6 4/12/12-10/2/12-KT-RIGHT ARM-5020 NYG, 24 PRES... NaN
7 JK DRUG therapy: aritrozole aritrozole
8 NITROZOLE STARTED ON 1/11/12 PER ADVICE nitrozole, started
9 KFC X 2 NaN
10 maritinib & fosclitaxel. maritinib, fosclitaxel
11 Urioxifen urioxifen
12 10/2/12 NEOADJUVANT FLOMIDEX NaN
13 10/29/12 YUMYCIN, KITOXAN, TACXOL yumycin, kitoxan, tacxol
14 11/11/14 (JK) GOODZOLE NaN
15 2/12/12 (CONN) petbine + pastlatin. conn, pastlatin
16 4/2014 (CONN) Continue PSCORE for 2 cycles. conn, continue, pscore, for, cycles
17 2/2015 to 5/2015 OSF (Stinson) XRT NaN
18 5/19/10-2/21/10 HEMYCIN AND BASKIXAN hemycin
19 5/2/12-5/12/12 1000NY/20FL/30MT/OT A2-A9 NaN
20 2/2015 OPC (JK) DRUG THERAPY NaN
注意:此解决方案已使用Pandas 0.19.2进行测试-您可能在Pandas版本<0.19.0()中遇到问题。我收到以下错误消息:AssertionError:1列通过,通过的数据有10列。@comprocho,恐怕我需要一个可复制的样本数据集来调试itI,因为我已经更新了原始帖子。谢谢你的帮助@comprocho,你也有多行值吗?@comprocho,
KFC X 2
-是单元格中的全部值吗?
In [25]: %paste
drugs_stem = pd.Series(suffix)
pat = r'\b(\w*(?:{})\w*)\b'.format(drugs_stem.str.cat(sep='|'))
df['Drugs'] = df.Text_Long.str.lower().str.extractall(pat).unstack() \
.apply(lambda x:', '.join(x.dropna()), axis=1)
## -- End pasted text --
In [26]: df
Out[26]:
Text_Long Drugs
0 2/1/14 (JK) DOCETAXEL, PYPHAMIDE docetaxel
1 2/10/14 (JK) NITROZOLE nitrozole
2 2/12/14-4/15/14 30MV PHOTONS TO LT arm, JC/WE... NaN
3 2/22/12 (Kansas/HEM)- NEOADJUVANT KITOTERE, DR... drymycin, kitoxan
4 4/11/11-11/24/11 (JK) CYCLOPHOSPHAMIDE, FLUORO... fluorouracil
5 4/14/14 (CONN) GEMZAR + OPR. 11/25/14 (CONN... conn, conn
6 4/12/12-10/2/12-KT-RIGHT ARM-5020 NYG, 24 PRES... NaN
7 JK DRUG therapy: aritrozole aritrozole
8 NITROZOLE STARTED ON 1/11/12 PER ADVICE nitrozole, started
9 KFC X 2 NaN
10 maritinib & fosclitaxel. maritinib, fosclitaxel
11 Urioxifen urioxifen
12 10/2/12 NEOADJUVANT FLOMIDEX NaN
13 10/29/12 YUMYCIN, KITOXAN, TACXOL yumycin, kitoxan, tacxol
14 11/11/14 (JK) GOODZOLE NaN
15 2/12/12 (CONN) petbine + pastlatin. conn, pastlatin
16 4/2014 (CONN) Continue PSCORE for 2 cycles. conn, continue, pscore, for, cycles
17 2/2015 to 5/2015 OSF (Stinson) XRT NaN
18 5/19/10-2/21/10 HEMYCIN AND BASKIXAN hemycin
19 5/2/12-5/12/12 1000NY/20FL/30MT/OT A2-A9 NaN
20 2/2015 OPC (JK) DRUG THERAPY NaN