Python 如何使用正则表达式从通话记录中提取（说话人、文本）元组？_Python_Regex_Findall

Python 如何使用正则表达式从通话记录中提取（说话人、文本）元组？

python regex

Python 如何使用正则表达式从通话记录中提取（说话人、文本）元组？,python,regex,findall,Python,Regex,Findall,对于我的硕士论文，我需要从公司盈利通话记录中提取说话人文本元组成绩单的格式如下： OPERATOR: Some text with numbers, special characters and linebreaks. NAME, COMPANY, POSITION: Some text with numbers, special characters and linebreaks. NAME: Some text with numbers, special characters and

对于我的硕士论文，我需要从公司盈利通话记录中提取说话人文本元组

成绩单的格式如下：

OPERATOR: Some text with numbers, special characters and linebreaks.

NAME, COMPANY, POSITION: Some text with numbers, special characters and linebreaks.

NAME: Some text with numbers, special characters and linebreaks.

我想从文档中提取所有说话人、文本元组。例如：

[("OPERATOR", "Some text with numbers, special characters and linebreaks."), ..]

到目前为止，我已经用Python中的re.findall函数尝试了不同的正则表达式

下面是一个示例摘录：

example = """OPERATOR: Good day, ladies and gentlemen, and welcome to the first-quarter 2012
Agilent Technologies earnings conference call. My name is Keith, and I will be
your operator for today. At this time, all participants are in a listen-only
mode. Later on, we will have a question and answer session. (Operator
Instructions) As a reminder, today's conference is being recorded for replay
purposes.

And I would now like to turn the conference over to your host for today, Ms.
Alicia Rodriguez, Vice President of Investor Relations. Please go ahead, ma'am.

ALICIA RODRIGUEZ, VP - IR, AGILENT TECHNOLOGIES INC: Thank you, Keith, and
welcome, everyone, to Agilent's first quarter conference call for fiscal-year
2012. With me are Agilent's President and CEO, Bill Sullivan, as well as Senior
Vice President and CFO, Didier Hirsch. Joining in the Q&A after Didier's
comments will be Agilent's Chief Operating Officer, Ron Nersesian, and the
Presidents of our Electronic Measurement, Life Sciences, and Chemical Analysis
Groups -- Guy Sene, Nick Roelofs, and Mike McMullen.

You can find the press release and information to supplement today's discussion
on our website at www.investor.agilent.com. While there, please click on the
link for financial results, where you will find revenue breakouts and historical
financials for Agilent's operations. We will also post a copy of the prepared
remarks following this call. For any non-GAAP financial measures, you will find
the most directly comparable GAAP financial metrics and reconciliations on our
website.

We will make forward-looking statements about the financial performance of the
Company. These statements are subject to risks and uncertainties, and are only
valid as of today. The Company assumes no obligation to update them. Please look
at the Company's recent SEC filings for a more complete picture of our risks and
other factors.

Before turning the call over to Bill, I would like to remind you that Agilent
will host its annual analysts meeting in New York City on March 8. Details about
the meeting and webcast will be available on the Agilent investor relations
website two weeks prior.

And now, I'd like to turn the call over to Bill.

BILL SULLIVAN, PRESIDENT AND CEO, AGILENT TECHNOLOGIES INC: Thanks, Alicia, and
hello, everyone. Agilent's Q1 orders of $1.62 billion were flat versus last
year. Q1 revenues of $1.64 billion were up 7% year-over-year. Non-GAAP EPS was
$0.69 per share, and operating margin was 19%."""

这是我的代码：

import re

# First approach:
r = re.compile(r"^([^a-z:]+?):([\s\S]+?)", flags=re.MULTILINE)
re.findall(r, example)

# Second approach:
r = re.compile(r"^([^a-z:]+?):([\s\S]+)", flags=re.MULTILINE)
re.findall(r, example)

第一种非贪婪方法的问题在于它不能捕获说话者的全文

第二种贪婪方法的问题在于，它不会在下一个说话人出现时停止

编辑：附加信息

文本组也可以包含双点。在某些情况下，一行的第一个字后面紧接着出现一个双点，例如：。。。扬声器组也可以覆盖多行，例如，当公司名称和职位描述很长时

您可以在不使用[\s\s]+的情况下进行匹配，因为这将匹配任何字符，包括新行

对于第二个捕获组，您可以匹配。*，然后使用具有负前瞻性的重复组，只要以下行不以？：\n[^a-z\r\n]+：

除非你通过定义规则来明确文本应该停在哪里，否则这是不可能的。您不想让我们帮您猜。请尝试r=re.compiler^[^a-z\n::+：.*=\n[^a-z\n::+：\z，re.MULTILINE | re.DOTALL，请参见。如果您要使用上面建议的正则表达式，最好不要使用它，而是尝试使用它。@usr2564301是的，这是可能的。我将用更多信息扩展我的问题。@revo为什么你的方法更好？在性能方面？这对于我的分析来说确实非常重要，因为数据集非常大。也有一些情况下，演讲者组包含两行，例如，当职位描述或公司名称非常长时。你有没有办法将这种情况合并到正则表达式中呢？这很好，但仍然有一种方法失败了。查看match 51的扬声器组以获取完整文档。你认为有可能申请吗？可能使用每个扬声器组前都有\n\n个扬声器组？在开始时不匹配a会修复它吗？不幸的是，这不是因为它不适用于导致相同问题的类似星座。但修复更容易。我刚刚在您的正则表达式的前面添加了另一个\n。现在它开始工作了：-看。我不确定OP是否真的在寻找这个正则表达式，但我从他们提供的链接中看到语言：英语在结尾处被遗漏了。

^([^a-z\r\n]+):(.*(?:(?!\n[^a-z\r\n]+:)[\r\n].*)*)