Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/regex/17.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 在re.findall搜索后,我应该得到的行数减少或增加_Python_Regex - Fatal编程技术网

Python 在re.findall搜索后,我应该得到的行数减少或增加

Python 在re.findall搜索后,我应该得到的行数减少或增加,python,regex,Python,Regex,我有未处理的文本,我想从中提取患者的性别,但最终我的行数不是更少就是更多,我应该如何处理此类错误 fil = data['transcription'] print(fil) 输出: 0 SUBJECTIVE:, This 23-year-old white female pr... 1 PAST MEDICAL HISTORY:, He has difficulty climb... 2 HISTORY OF PRESENT ILLNESS: , I h

我有未处理的文本,我想从中提取患者的性别,但最终我的行数不是更少就是更多,我应该如何处理此类错误

fil = data['transcription']
print(fil)
输出:

0       SUBJECTIVE:,  This 23-year-old white female pr...
1       PAST MEDICAL HISTORY:, He has difficulty climb...
2       HISTORY OF PRESENT ILLNESS: , I have seen ABC ...
3       2-D M-MODE: , ,1.  Left atrial enlargement wit...
4       1.  The left ventricular cavity size and wall ...
                              ...                        
4994    HISTORY:,  I had the pleasure of meeting and e...
4995    ADMITTING DIAGNOSIS: , Kawasaki disease.,DISCH...
4996    SUBJECTIVE: , This is a 42-year-old white fema...
4997    CHIEF COMPLAINT: , This 5-year-old male presen...
4998    HISTORY: , A 34-year-old male presents today s...
Name: transcription, Length: 4999, dtype: object
4967 or 5032 #it should be 4999 when i do print(len(gender_aux))
['female', 'male', 'male', ' ', 'male', 'male', 'male', 'male', 'male', ' ', 'male'...]
这是从文本中提取性别的代码

import re

gender_aux = []
for i in fil:

    try:
        gender = re.findall("female|gentleman|woman|lady|man|male|girl|boy|she|he", i) or [" "]
    except:
        gender_aux.append(' ')
#         pass

    gender_dict = {"male": ["gentleman", "man", "male", "boy",'he'],
               "female": ["lady","female", "woman", "girl",'she']}

    for g in gender:
        if g in gender_dict['male']:
            gender_aux.append('male')
            break
        elif g in gender_dict['female']:
            gender_aux.append('female')
            break
        else:
            gender_aux+=[' ']
            break
print(len(gender_aux))            
print(gender_aux)
如果我删除或[“”]其他则得到4967 ,否则我将得到5032个实例,实际上我将收到4999个total实例

输出:

0       SUBJECTIVE:,  This 23-year-old white female pr...
1       PAST MEDICAL HISTORY:, He has difficulty climb...
2       HISTORY OF PRESENT ILLNESS: , I have seen ABC ...
3       2-D M-MODE: , ,1.  Left atrial enlargement wit...
4       1.  The left ventricular cavity size and wall ...
                              ...                        
4994    HISTORY:,  I had the pleasure of meeting and e...
4995    ADMITTING DIAGNOSIS: , Kawasaki disease.,DISCH...
4996    SUBJECTIVE: , This is a 42-year-old white fema...
4997    CHIEF COMPLAINT: , This 5-year-old male presen...
4998    HISTORY: , A 34-year-old male presents today s...
Name: transcription, Length: 4999, dtype: object
4967 or 5032 #it should be 4999 when i do print(len(gender_aux))
['female', 'male', 'male', ' ', 'male', 'male', 'male', 'male', 'male', ' ', 'male'...]

我注意到您没有转换到
lower()
flags=re.IGNORECASE
,这可能会影响您的最终字数

主要问题是当
re.findall
与字符串中的任何性别不匹配时,您的 for循环最终将无法运行。为了避免这种情况,我检查是否有来自的匹配项
re.findall
在该行上,如果没有,只需附加空白字符串即可

import pandas as pd
import re

text = pd.Series([
    "SUBJECTIVE:,  This 23-year-old white female pr...",
    "PAST MEDICAL HISTORY:, He has difficulty climb...",
    "HISTORY OF PRESENT ILLNESS: , I have seen ABC ...",
    "2-D M-MODE: , ,1.  Left atrial enlargement wit...",
    "1.  The left ventricular cavity size and wall ...",
    "HISTORY:,  I had the pleasure of meeting and e...",
    "ADMITTING DIAGNOSIS: , Kawasaki disease.,DISCH...",
    "SUBJECTIVE: , This is a 42-year-old white fema...",
    "CHIEF COMPLAINT: , This 5-year-old male presen...",
    "HISTORY: , A 34-year-old male presents today s..."
])

gender_dict = {"male": ["gentleman", "man", "male", "boy",'he'],
               "female": ["lady","female", "woman", "girl",'she']}

gender_aux = []
for line in text:
    gender = re.findall("female|gentleman|woman|lady|man|male|girl|boy|she|he", line.lower())

    if len(gender):
        for g in gender:
            if g in gender_dict['male']:
                gender_aux.append('male')
                break
            elif g in gender_dict['female']:
                gender_aux.append('female')
                break
    else: # no gender match
        gender_aux.append(' ')

print(len(gender_aux))
print(gender_aux)
输出

10
['female', 'male', ' ', ' ', 'male', 'male', ' ', ' ', 'male', 'male']

请提供一个。我忘了添加:为什么不利用Pandas为此提供的功能?@AMC哪个功能?0.oPandas提供了一些正则表达式/文本操作方法。你不能使用命名的捕获组或两个不同的正则表达式,并大大简化事情吗?我试试看。