PYTHON-如何从文本文件中提取包含引用标记的句子例如，我在下面有3个句子，其中1个句子在中间包含引号标记（沃伦和Pereira，1982）< /代码>。引用总是用以下格式括起来：（~string~逗号（，）~space~数字~）_Python_Regex_Nlp_Text Extraction_Citations

PYTHON-如何从文本文件中提取包含引用标记的句子例如，我在下面有3个句子，其中1个句子在中间包含引号标记（沃伦和Pereira，1982）< /代码>。引用总是用以下格式括起来：（~string~逗号（，）~space~数字~）

python regex nlp

PYTHON-如何从文本文件中提取包含引用标记的句子例如，我在下面有3个句子，其中1个句子在中间包含引号标记（沃伦和Pereira，1982）< /代码>。引用总是用以下格式括起来：（~string~逗号（，）~space~数字~）,python,regex,nlp,text-extraction,citations,Python,Regex,Nlp,Text Extraction,Citations,他住在尼达沃尔，今晚我必须在6点钟赶火车去奥斯陆。这个名为BusTUC的系统是建立在经典的CHAT-80系统之上的（Warren和Pereira，1982）。CHAT-80是一种最先进的自然语言系统其本身的优点令人印象深刻我使用正则表达式只提取中间的句子，但它会打印所有3个句子。结果应该是这样的：这个名为BusTUC的系统是建立在经典的CHAT-80系统之上的（Warren和Pereira，1982）您可以将文本拆分为一系列句子，然后选择以“）”结尾的句子您可以将文本拆分为一系列句子

他住在尼达沃尔，今晚我必须在6点钟赶火车去奥斯陆。这个名为BusTUC的系统是建立在经典的CHAT-80系统之上的（Warren和Pereira，1982）。CHAT-80是一种最先进的自然语言系统其本身的优点令人印象深刻

我使用正则表达式只提取中间的句子，但它会打印所有3个句子。结果应该是这样的：

这个名为BusTUC的系统是建立在经典的CHAT-80系统之上的（Warren和Pereira，1982）

您可以将文本拆分为一系列句子，然后选择以“）”结尾的句子

设置。。。代表利益案件的两句话：

text = "He lives in Nidarvoll and tonight i must reach a train to Oslo at 6 oclock. The system, called BusTUC is built upon the classical system CHAT-80 (Warren and Pereira, 1982). CHAT-80 was a state of the art natural language system that was impressive on its own merits."

t2 = "He lives in Nidarvoll and tonight i must reach a train to Oslo at 6 oclock. The system, called BusTUC is built upon the classical system CHAT-80 (Warren and Pereira, 1982) fgbhdr was a state of the art natural. CHAT-80 was a state of the art natural language system that was impressive on its own merits."

首先，在引用位于句子末尾的情况下匹配：

p1 = "\. (.*\([A-za-z]+ .* [0-9]+\)\.+?)"

p2 = "\. (.*\([A-za-z]+ .* [0-9]+\)[^\.]+\.+?)"

要在引用不在句末时匹配：

p1 = "\. (.*\([A-za-z]+ .* [0-9]+\)\.+?)"

p2 = "\. (.*\([A-za-z]+ .* [0-9]+\)[^\.]+\.+?)"

将这两种情况与“|”正则表达式运算符结合使用：

p_main = re.compile("\. (.*\([A-za-z]+ .* [0-9]+\)\.+?)"
                "|\. (.*\([A-za-z]+ .* [0-9]+\)[^\.]+\.+?)")

运行：

>>> print(re.findall(p_main, text))
[('The system, called BusTUC is built upon the classical system CHAT-80 (Warren and Pereira, 1982).', '')]

>>>print(re.findall(p_main, t2))
[('', 'The system, called BusTUC is built upon the classical system CHAT-80 (Warren and Pereira, 1982) fgbhdr was a state of the art natural.')]

在这两种情况下，你都会得到带有引文的句子

python正则表达式和附带的正则表达式页面是一个很好的资源

干杯