Warning: file_get_contents(/data/phpspider/zhask/data//catemap/8/.htaccess/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python正则表达式-在文本文件中的(多个)表达式之间提取文本_Python_Regex_Text Mining_Text Extraction - Fatal编程技术网

Python正则表达式-在文本文件中的(多个)表达式之间提取文本

Python正则表达式-在文本文件中的(多个)表达式之间提取文本,python,regex,text-mining,text-extraction,Python,Regex,Text Mining,Text Extraction,我是Python初学者,如果您能帮助我解决文本提取问题,我将非常感激 我想提取文本文件中两个表达式之间的所有文本(字母的开头和结尾)。对于字母的开头和结尾,都有多种可能的表达方式(在列表中定义为“字母开始”和“字母结束”,例如“亲爱的”、“给我们的”等)。我想对一堆文件进行分析,下面是一个这样的文本文件的示例->我想提取从“亲爱的”到“道格拉斯”的所有文本。如果“字母\结尾”不匹配,即未找到字母\结尾表达式,则输出应从字母\开头开始,并在要分析的文本文件的最末端结束 编辑:“记录的文本”的结尾必

我是Python初学者,如果您能帮助我解决文本提取问题,我将非常感激

我想提取文本文件中两个表达式之间的所有文本(字母的开头和结尾)。对于字母的开头和结尾,都有多种可能的表达方式(在列表中定义为“字母开始”和“字母结束”,例如“亲爱的”、“给我们的”等)。我想对一堆文件进行分析,下面是一个这样的文本文件的示例->我想提取从“亲爱的”到“道格拉斯”的所有文本。如果“字母\结尾”不匹配,即未找到字母\结尾表达式,则输出应从字母\开头开始,并在要分析的文本文件的最末端结束

编辑:“记录的文本”的结尾必须在“字母\结尾”匹配之后,并且在第一行20个字符或更多字符之前(与“此处也是随机文本”->len=24的情况相同)

"""Some random text here
 
Dear Shareholders We
are pleased to provide you with this semiannual report for Fund for the six-month period ended April 30, 2018. For additional information about the Fund, please visit our website a, where you can access quarterly commentaries. We value the trust that you place in us and look forward to serving your investment needs in the years to come.
Best regards 
Douglas

Random text here as well"""
到目前为止,这是我的代码-但它无法灵活捕捉表达式之间的文本(在“字母开始”之前和“字母结束”之后可以有任何内容(行、文本、数字、符号等))

我非常感谢您的帮助!

您可以使用

regex = r"(?:{})[\s\S]*?(?:{}).*(?:\n.*){{0,2}}".format(openings, closings)
此模式将产生类似正则表达式的

(?:dear|to our|estimated)[\s\S]*?(?:sincerely|yours|best regards).*(?:\n.*){0,2}
请参阅。注意:您不应将
re.DOTALL
用于此模式,并且
re.MULTILINE
选项也是多余的

详细信息

  • (?:亲爱的|我们的|估计值)
    -三个值中的任意一个
  • [\s\s]*?
    -任何0+字符,尽可能少
  • (?:真诚的|您的|问候)
    -三种价值观中的任何一种
  • *
    -除换行符以外的任何0+字符
  • (?:\n.*){0,2}
    -零,换行符的一次或两次重复,后跟除换行符以外的任何0+字符
:

输出:

['Dear Shareholders We\nare pleased to provide you with this semiannual report for Fund for the six-month period ended April 30, 2018. For additional information about the Fund, please visit our website a, where you can access quarterly commentaries. We value the trust that you place in us and look forward to serving your investment needs in the years to come.\nBest regards \nDouglas\n']

你说我想提取从“亲爱的”到“道格拉斯”的所有文本,但是你的正则表达式没有
Douglas
,\n\S+
会阻止正则表达式匹配,即使你把它添加到
字母结尾
。也许你想要的只是
regex=r“(?:“+openings+r”).?+r”(?:“+closings+r”)”
?@WiktorStribiżew:非常感谢您的帮助-这看起来已经很好了!您知道如何在定义的“字母żend”之后获得接下来的5个单词吗?(这样我就可以在结束表达式之后获得任何名称?)您如何定义“单词”?它们之间可以有什么字符?如果你匹配5个单词,你可能得到的不仅仅是
Douglas
。好的,我看到了问题。有没有办法告诉正则表达式在“字母\u end”之后获得“接下来的2行”,因为“其他随机文本”将从字母\u end开始至少3行?->r”(?:“+openings+r”)。+r”(?:“+closings+[\Line+\Line+{0,2}r”)“?删除
re.DOTALL
并使用,即
regex=r”(?:“+openings+r”)[\s\s]*?“+r”(?:“+closins+r”).*(?::“+closings+r”)。(?::::。\n.{0,2}”
。您也不需要
re.MULTILINE
,顺便说一句。非常感谢Wiktor!我需要对正则表达式代码进行最后一次编辑:我需要输出文本在“letter_end”匹配后的第一行之前停止,该行中有20多个字符。在上面的示例中,它将生成与len相同的输出(“此处也是随机文本”)=24.在正则表达式语句末尾满足的条件:在匹配“letter_end”后的第行停止,其中该行包含>20个字符)@DominikScheld
r”(?:{})[\s\s]*?(?:{})。*(?:\n.{{0,19}}$)*”,但需要使用它的
re.M`flag。
import re
text="""Some random text here

Dear Shareholders We
are pleased to provide you with this semiannual report for Fund for the six-month period ended April 30, 2018. For additional information about the Fund, please visit our website a, where you can access quarterly commentaries. We value the trust that you place in us and look forward to serving your investment needs in the years to come.
Best regards 
Douglas

Random text here as well"""
letter_begin = ["dear", "to our", "estimated"] # All expressions for "beginning" of letter 
openings = "|".join(letter_begin)
letter_end = ["sincerely", "yours", "best regards"] # All expressions for "ending" of Letter 
closings = "|".join(letter_end)
regex = r"(?:{})[\s\S]*?(?:{}).*(?:\n.*){{0,2}}".format(openings, closings)
print(regex)
print(re.findall(regex, text, re.IGNORECASE))
['Dear Shareholders We\nare pleased to provide you with this semiannual report for Fund for the six-month period ended April 30, 2018. For additional information about the Fund, please visit our website a, where you can access quarterly commentaries. We value the trust that you place in us and look forward to serving your investment needs in the years to come.\nBest regards \nDouglas\n']