Python：使用regex查找最后一对事件_Python_Regex

Python：使用regex查找最后一对事件

python regex

Python：使用regex查找最后一对事件,python,regex,Python,Regex,附件是一个我想解析的。我想选择单词出现的最后一个组合中的文本：（1）第7项管理讨论分析（2）第8项财务报表我通常使用regex如下： re.findall(r"Item(?:(?!Item).)*7(?:(?!Item|7).)*Management(?:(?!Item|7|Management).)*Analysis[\s\S]*Item(?:(?!Item).)*8(?:(?!Item|8).)*Financial(?:(?!Item|8|Financial).)*Statemen

附件是一个我想解析的。我想选择单词出现的最后一个组合中的文本：

（1）第7项管理讨论分析
（2）第8项财务报表

我通常使用

regex

如下：

re.findall(r"Item(?:(?!Item).)*7(?:(?!Item|7).)*Management(?:(?!Item|7|Management).)*Analysis[\s\S]*Item(?:(?!Item).)*8(?:(?!Item|8).)*Financial(?:(?!Item|8|Financial).)*Statements",text, re.DOTALL)

您可以在文本文件中看到，项目7和项目8的组合经常出现，但是如果我找到了最后一个匹配项（1）和最后一个匹配项（2），我会大大增加获取所需文本的概率

我的文本文件中所需的文本以以下开头：

“'本项目7，管理层的讨论和财务状况和经营成果分析，以及其他本表格10-K的部分包含前瞻性陈述，在 1995年《私人证券诉讼改革法案》的含义是涉及风险和……。”

最后是：

“项目8。财务报表和补充数据”

我如何调整我的正则表达式代码以获取第7项和第8项之间的最后一对

更新：

我还尝试使用相同的项来解析它

将此模式与

选项一起使用

.*(Item 7.*?Item 8)

捕获组#1的结果

尝试此操作。添加了一个前瞻。

此代码已被重写。它现在可以同时使用原始数据文件（Output2.txt）和新添加的数据文件（Output2012.txt）

discussions变量包含每个数据文件的结果

这是最初的解决方案。它不适用于新文件，但显示了命名组的使用。我不熟悉这里的StackOverflow协议。我应该删除这个旧代码吗

通过使用更长的匹配字符串，两个项目7的匹配数可以减少到只有2个第8项-目录和实际章节

因此，搜索第二次出现的第7项，并将所有文本保留到第8项。此代码使用 Python命名组

import re

with open('Output2.txt') as f:
    doc = f.read()

item7 = r"Item 7\.*\s*Management.s Discussion and Analysis of Financial Condition and Results of Operations"
item8 = r"Item 8\.*\s*Financial Statements"

discussion_pattern = re.compile(
    r"(?P<item7>" + item7 + ")"
    r"([\S\s]*)"
    r"(?P<item7heading>" + item7 +")"
    r"(?P<discussion>[\S\s]*)"
    r"(?P<item8heading>" + item8 + ")"
)       

match = re.search(discussion_pattern, doc)
discussion = match.group('discussion')

重新导入
将open（'Output2.txt'）作为f：
doc=f.read（）
项目7=r“项目7\.\s*管理层对财务状况和经营结果的讨论和分析”
项目8=r“项目8\.\s*财务报表”
讨论模式=重新编译(
r“（？P“+项目7+”）
r“（[\S\S]*）”
r“（？P“+项目7+”）
r“（？P[\S\S]*）”
r“（？P“+项目8+”）
)       
匹配=重新搜索（讨论模式，文档）
讨论=匹配。组（“讨论”）

很有趣。效果很好。现在我必须了解

re.compile

和

re.search

到底做了什么。。。但它确实抓住了我想要的东西。这是\.\来捕捉

第7项中的点。

？我的一些.txt文件没有“\”。。。我可以移除它吗？\。用于匹配句点。\s用于匹配空白（包括新行）。与其删除\，我建议在后面加一个“*”（不带引号），这样它就可以匹配零个或多个句点。如果是\。如果删除，代码将不适用于具有句点的文件。我将编辑代码以添加*。我查看了新文件Output2012.txt。第二个文件不像第一个表那样有目录。我将更改代码以处理这两个文件。编写的代码同时支持Output2.txt和Output2012.txt。代码现在使用re.findall（）并选择最后一个匹配项。

re.findall(r"Item(?:(?!Item).)*7(?:(?!Item|7).)*Management(?:(?!Item|7|Management).)*Analysis[\s\S]*Item(?:(?!Item).)*8(?:(?!Item|8).)*Financial(?:(?!Item|8|Financial).)*Statements(?!.*?(?:Item(?:(?!Item).)*7)|(?:Item(?:(?!Item).)*8))",text, re.DOTALL)

import re

discussions = []
for input_file_name in ['Output2.txt', 'Output2012.txt']:
    with open(input_file_name) as f:
        doc = f.read()

    item7 = r"Item 7\.*\s*Management.s Discussion and Analysis of Financial Condition and Results of Operations"
    discussion_text = r"[\S\s]*"
    item8 = r"Item 8\.*\s*Financial Statements"

    discussion_pattern = item7 + discussion_text + item8
    results = re.findall(discussion_pattern, doc)

    # Some input files have table of contents and others don't 
    # just keep the last match
    discussion = results[len(results)-1]

    discussions.append((input_file_name, discussion))

import re

with open('Output2.txt') as f:
    doc = f.read()

item7 = r"Item 7\.*\s*Management.s Discussion and Analysis of Financial Condition and Results of Operations"
item8 = r"Item 8\.*\s*Financial Statements"

discussion_pattern = re.compile(
    r"(?P<item7>" + item7 + ")"
    r"([\S\s]*)"
    r"(?P<item7heading>" + item7 +")"
    r"(?P<discussion>[\S\s]*)"
    r"(?P<item8heading>" + item8 + ")"
)       

match = re.search(discussion_pattern, doc)
discussion = match.group('discussion')