Python 在re.findall搜索后,我应该得到的行数减少或增加
我有未处理的文本,我想从中提取患者的性别,但最终我的行数不是更少就是更多,我应该如何处理此类错误Python 在re.findall搜索后,我应该得到的行数减少或增加,python,regex,Python,Regex,我有未处理的文本,我想从中提取患者的性别,但最终我的行数不是更少就是更多,我应该如何处理此类错误 fil = data['transcription'] print(fil) 输出: 0 SUBJECTIVE:, This 23-year-old white female pr... 1 PAST MEDICAL HISTORY:, He has difficulty climb... 2 HISTORY OF PRESENT ILLNESS: , I h
fil = data['transcription']
print(fil)
输出:
0 SUBJECTIVE:, This 23-year-old white female pr...
1 PAST MEDICAL HISTORY:, He has difficulty climb...
2 HISTORY OF PRESENT ILLNESS: , I have seen ABC ...
3 2-D M-MODE: , ,1. Left atrial enlargement wit...
4 1. The left ventricular cavity size and wall ...
...
4994 HISTORY:, I had the pleasure of meeting and e...
4995 ADMITTING DIAGNOSIS: , Kawasaki disease.,DISCH...
4996 SUBJECTIVE: , This is a 42-year-old white fema...
4997 CHIEF COMPLAINT: , This 5-year-old male presen...
4998 HISTORY: , A 34-year-old male presents today s...
Name: transcription, Length: 4999, dtype: object
4967 or 5032 #it should be 4999 when i do print(len(gender_aux))
['female', 'male', 'male', ' ', 'male', 'male', 'male', 'male', 'male', ' ', 'male'...]
这是从文本中提取性别的代码
import re
gender_aux = []
for i in fil:
try:
gender = re.findall("female|gentleman|woman|lady|man|male|girl|boy|she|he", i) or [" "]
except:
gender_aux.append(' ')
# pass
gender_dict = {"male": ["gentleman", "man", "male", "boy",'he'],
"female": ["lady","female", "woman", "girl",'she']}
for g in gender:
if g in gender_dict['male']:
gender_aux.append('male')
break
elif g in gender_dict['female']:
gender_aux.append('female')
break
else:
gender_aux+=[' ']
break
print(len(gender_aux))
print(gender_aux)
如果我删除或[“”]或其他则得到4967
,否则我将得到5032个实例,实际上我将收到4999个total实例
输出:
0 SUBJECTIVE:, This 23-year-old white female pr...
1 PAST MEDICAL HISTORY:, He has difficulty climb...
2 HISTORY OF PRESENT ILLNESS: , I have seen ABC ...
3 2-D M-MODE: , ,1. Left atrial enlargement wit...
4 1. The left ventricular cavity size and wall ...
...
4994 HISTORY:, I had the pleasure of meeting and e...
4995 ADMITTING DIAGNOSIS: , Kawasaki disease.,DISCH...
4996 SUBJECTIVE: , This is a 42-year-old white fema...
4997 CHIEF COMPLAINT: , This 5-year-old male presen...
4998 HISTORY: , A 34-year-old male presents today s...
Name: transcription, Length: 4999, dtype: object
4967 or 5032 #it should be 4999 when i do print(len(gender_aux))
['female', 'male', 'male', ' ', 'male', 'male', 'male', 'male', 'male', ' ', 'male'...]
我注意到您没有转换到
lower()
或flags=re.IGNORECASE
,这可能会影响您的最终字数
主要问题是当re.findall
与字符串中的任何性别不匹配时,您的
for循环最终将无法运行。为了避免这种情况,我检查是否有来自的匹配项
re.findall
在该行上,如果没有,只需附加空白字符串即可
import pandas as pd
import re
text = pd.Series([
"SUBJECTIVE:, This 23-year-old white female pr...",
"PAST MEDICAL HISTORY:, He has difficulty climb...",
"HISTORY OF PRESENT ILLNESS: , I have seen ABC ...",
"2-D M-MODE: , ,1. Left atrial enlargement wit...",
"1. The left ventricular cavity size and wall ...",
"HISTORY:, I had the pleasure of meeting and e...",
"ADMITTING DIAGNOSIS: , Kawasaki disease.,DISCH...",
"SUBJECTIVE: , This is a 42-year-old white fema...",
"CHIEF COMPLAINT: , This 5-year-old male presen...",
"HISTORY: , A 34-year-old male presents today s..."
])
gender_dict = {"male": ["gentleman", "man", "male", "boy",'he'],
"female": ["lady","female", "woman", "girl",'she']}
gender_aux = []
for line in text:
gender = re.findall("female|gentleman|woman|lady|man|male|girl|boy|she|he", line.lower())
if len(gender):
for g in gender:
if g in gender_dict['male']:
gender_aux.append('male')
break
elif g in gender_dict['female']:
gender_aux.append('female')
break
else: # no gender match
gender_aux.append(' ')
print(len(gender_aux))
print(gender_aux)
输出
10
['female', 'male', ' ', ' ', 'male', 'male', ' ', ' ', 'male', 'male']
请提供一个。我忘了添加:为什么不利用Pandas为此提供的功能?@AMC哪个功能?0.oPandas提供了一些正则表达式/文本操作方法。你不能使用命名的捕获组或两个不同的正则表达式,并大大简化事情吗?我试试看。