Python 在re.findall搜索后，我应该得到的行数减少或增加_Python_Regex

Python 在re.findall搜索后，我应该得到的行数减少或增加

python regex

Python 在re.findall搜索后，我应该得到的行数减少或增加,python,regex,Python,Regex,我有未处理的文本，我想从中提取患者的性别，但最终我的行数不是更少就是更多，我应该如何处理此类错误 fil = data['transcription'] print(fil) 输出： 0 SUBJECTIVE:, This 23-year-old white female pr... 1 PAST MEDICAL HISTORY:, He has difficulty climb... 2 HISTORY OF PRESENT ILLNESS: , I h

我有未处理的文本，我想从中提取患者的性别，但最终我的行数不是更少就是更多，我应该如何处理此类错误

fil = data['transcription']
print(fil)

输出：

0       SUBJECTIVE:,  This 23-year-old white female pr...
1       PAST MEDICAL HISTORY:, He has difficulty climb...
2       HISTORY OF PRESENT ILLNESS: , I have seen ABC ...
3       2-D M-MODE: , ,1.  Left atrial enlargement wit...
4       1.  The left ventricular cavity size and wall ...
                              ...                        
4994    HISTORY:,  I had the pleasure of meeting and e...
4995    ADMITTING DIAGNOSIS: , Kawasaki disease.,DISCH...
4996    SUBJECTIVE: , This is a 42-year-old white fema...
4997    CHIEF COMPLAINT: , This 5-year-old male presen...
4998    HISTORY: , A 34-year-old male presents today s...
Name: transcription, Length: 4999, dtype: object

4967 or 5032 #it should be 4999 when i do print(len(gender_aux))
['female', 'male', 'male', ' ', 'male', 'male', 'male', 'male', 'male', ' ', 'male'...]

这是从文本中提取性别的代码

import re

gender_aux = []
for i in fil:

    try:
        gender = re.findall("female|gentleman|woman|lady|man|male|girl|boy|she|he", i) or [" "]
    except:
        gender_aux.append(' ')
#         pass

    gender_dict = {"male": ["gentleman", "man", "male", "boy",'he'],
               "female": ["lady","female", "woman", "girl",'she']}

    for g in gender:
        if g in gender_dict['male']:
            gender_aux.append('male')
            break
        elif g in gender_dict['female']:
            gender_aux.append('female')
            break
        else:
            gender_aux+=[' ']
            break
print(len(gender_aux))            
print(gender_aux)

如果我删除或[“”]或其他则得到4967 ，否则我将得到5032个实例，实际上我将收到4999个total实例
输出：

0 SUBJECTIVE:, This 23-year-old white female pr... 1 PAST MEDICAL HISTORY:, He has difficulty climb... 2 HISTORY OF PRESENT ILLNESS: , I have seen ABC ... 3 2-D M-MODE: , ,1. Left atrial enlargement wit... 4 1. The left ventricular cavity size and wall ... ... 4994 HISTORY:, I had the pleasure of meeting and e... 4995 ADMITTING DIAGNOSIS: , Kawasaki disease.,DISCH... 4996 SUBJECTIVE: , This is a 42-year-old white fema... 4997 CHIEF COMPLAINT: , This 5-year-old male presen... 4998 HISTORY: , A 34-year-old male presents today s... Name: transcription, Length: 4999, dtype: object

4967 or 5032 #it should be 4999 when i do print(len(gender_aux)) ['female', 'male', 'male', ' ', 'male', 'male', 'male', 'male', 'male', ' ', 'male'...]

我注意到您没有转换到
lower（）
或
flags=re.IGNORECASE
，这可能会影响您的最终字数
主要问题是当
re.findall
与字符串中的任何性别不匹配时，您的 for循环最终将无法运行。为了避免这种情况，我检查是否有来自的匹配项
re.findall
在该行上，如果没有，只需附加空白字符串即可

import pandas as pd import re text = pd.Series([ "SUBJECTIVE:, This 23-year-old white female pr...", "PAST MEDICAL HISTORY:, He has difficulty climb...", "HISTORY OF PRESENT ILLNESS: , I have seen ABC ...", "2-D M-MODE: , ,1. Left atrial enlargement wit...", "1. The left ventricular cavity size and wall ...", "HISTORY:, I had the pleasure of meeting and e...", "ADMITTING DIAGNOSIS: , Kawasaki disease.,DISCH...", "SUBJECTIVE: , This is a 42-year-old white fema...", "CHIEF COMPLAINT: , This 5-year-old male presen...", "HISTORY: , A 34-year-old male presents today s..." ]) gender_dict = {"male": ["gentleman", "man", "male", "boy",'he'], "female": ["lady","female", "woman", "girl",'she']} gender_aux = [] for line in text: gender = re.findall("female|gentleman|woman|lady|man|male|girl|boy|she|he", line.lower()) if len(gender): for g in gender: if g in gender_dict['male']: gender_aux.append('male') break elif g in gender_dict['female']: gender_aux.append('female') break else: # no gender match gender_aux.append(' ') print(len(gender_aux)) print(gender_aux)
输出

10 ['female', 'male', ' ', ' ', 'male', 'male', ' ', ' ', 'male', 'male']

请提供一个。我忘了添加：为什么不利用Pandas为此提供的功能？@AMC哪个功能？0.oPandas提供了一些正则表达式/文本操作方法。你不能使用命名的捕获组或两个不同的正则表达式，并大大简化事情吗？我试试看。