Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/regex/16.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/5/date/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
PYTHON-如何从文本文件中提取包含引用标记的句子 例如,我在下面有3个句子,其中1个句子在中间包含引号标记(沃伦和Pereira,1982)< /代码>。引用总是用以下格式括起来:(~string~逗号(,)~space~数字~)_Python_Regex_Nlp_Text Extraction_Citations - Fatal编程技术网

PYTHON-如何从文本文件中提取包含引用标记的句子 例如,我在下面有3个句子,其中1个句子在中间包含引号标记(沃伦和Pereira,1982)< /代码>。引用总是用以下格式括起来:(~string~逗号(,)~space~数字~)

PYTHON-如何从文本文件中提取包含引用标记的句子 例如,我在下面有3个句子,其中1个句子在中间包含引号标记(沃伦和Pereira,1982)< /代码>。引用总是用以下格式括起来:(~string~逗号(,)~space~数字~),python,regex,nlp,text-extraction,citations,Python,Regex,Nlp,Text Extraction,Citations,他住在尼达沃尔,今晚我必须在6点钟赶火车去奥斯陆。这个名为BusTUC的系统是建立在经典的CHAT-80系统之上的(Warren和Pereira,1982)。CHAT-80是一种最先进的自然语言系统 其本身的优点令人印象深刻 我使用正则表达式只提取中间的句子,但它会打印所有3个句子。 结果应该是这样的: 这个名为BusTUC的系统是建立在经典的CHAT-80系统之上的(Warren和Pereira,1982) 您可以将文本拆分为一系列句子,然后选择以“)”结尾的句子 您可以将文本拆分为一系列句子

他住在尼达沃尔,今晚我必须在6点钟赶火车去奥斯陆。这个名为BusTUC的系统是建立在经典的CHAT-80系统之上的(Warren和Pereira,1982)。CHAT-80是一种最先进的自然语言系统 其本身的优点令人印象深刻

我使用正则表达式只提取中间的句子,但它会打印所有3个句子。 结果应该是这样的:

这个名为BusTUC的系统是建立在经典的CHAT-80系统之上的(Warren和Pereira,1982)

您可以将文本拆分为一系列句子,然后选择以“)”结尾的句子

您可以将文本拆分为一系列句子,然后选择以“)”结尾的句子


设置。。。代表利益案件的两句话:

text = "He lives in Nidarvoll and tonight i must reach a train to Oslo at 6 oclock. The system, called BusTUC is built upon the classical system CHAT-80 (Warren and Pereira, 1982). CHAT-80 was a state of the art natural language system that was impressive on its own merits."

t2 = "He lives in Nidarvoll and tonight i must reach a train to Oslo at 6 oclock. The system, called BusTUC is built upon the classical system CHAT-80 (Warren and Pereira, 1982) fgbhdr was a state of the art natural. CHAT-80 was a state of the art natural language system that was impressive on its own merits."
首先,在引用位于句子末尾的情况下匹配:

p1 = "\. (.*\([A-za-z]+ .* [0-9]+\)\.+?)"
p2 = "\. (.*\([A-za-z]+ .* [0-9]+\)[^\.]+\.+?)"
要在引用不在句末时匹配:

p1 = "\. (.*\([A-za-z]+ .* [0-9]+\)\.+?)"
p2 = "\. (.*\([A-za-z]+ .* [0-9]+\)[^\.]+\.+?)"
将这两种情况与“|”正则表达式运算符结合使用:

p_main = re.compile("\. (.*\([A-za-z]+ .* [0-9]+\)\.+?)"
                "|\. (.*\([A-za-z]+ .* [0-9]+\)[^\.]+\.+?)")
运行:

>>> print(re.findall(p_main, text))
[('The system, called BusTUC is built upon the classical system CHAT-80 (Warren and Pereira, 1982).', '')]

>>>print(re.findall(p_main, t2))
[('', 'The system, called BusTUC is built upon the classical system CHAT-80 (Warren and Pereira, 1982) fgbhdr was a state of the art natural.')]
在这两种情况下,你都会得到带有引文的句子

python正则表达式和附带的正则表达式页面是一个很好的资源


干杯

设置。。。代表利益案件的两句话:

text = "He lives in Nidarvoll and tonight i must reach a train to Oslo at 6 oclock. The system, called BusTUC is built upon the classical system CHAT-80 (Warren and Pereira, 1982). CHAT-80 was a state of the art natural language system that was impressive on its own merits."

t2 = "He lives in Nidarvoll and tonight i must reach a train to Oslo at 6 oclock. The system, called BusTUC is built upon the classical system CHAT-80 (Warren and Pereira, 1982) fgbhdr was a state of the art natural. CHAT-80 was a state of the art natural language system that was impressive on its own merits."
首先,在引用位于句子末尾的情况下匹配:

p1 = "\. (.*\([A-za-z]+ .* [0-9]+\)\.+?)"
p2 = "\. (.*\([A-za-z]+ .* [0-9]+\)[^\.]+\.+?)"
要在引用不在句末时匹配:

p1 = "\. (.*\([A-za-z]+ .* [0-9]+\)\.+?)"
p2 = "\. (.*\([A-za-z]+ .* [0-9]+\)[^\.]+\.+?)"
将这两种情况与“|”正则表达式运算符结合使用:

p_main = re.compile("\. (.*\([A-za-z]+ .* [0-9]+\)\.+?)"
                "|\. (.*\([A-za-z]+ .* [0-9]+\)[^\.]+\.+?)")
运行:

>>> print(re.findall(p_main, text))
[('The system, called BusTUC is built upon the classical system CHAT-80 (Warren and Pereira, 1982).', '')]

>>>print(re.findall(p_main, t2))
[('', 'The system, called BusTUC is built upon the classical system CHAT-80 (Warren and Pereira, 1982) fgbhdr was a state of the art natural.')]
在这两种情况下,你都会得到带有引文的句子

python正则表达式和附带的正则表达式页面是一个很好的资源


欢呼声

始终是中间句还是引文总是在括号内?并不总是在中间句中,最重要的是引文总是以这种格式的括号(~字符串~逗号,~~空间~~号):它总是中间句还是引文总是在括号内?并不总是在中间句中,最重要的是引用总是用这种格式(~string~逗号(,)~space~ number~)Thx括起来,所以我不必总是使用正则表达式。但是,如果引用不是在句子的末尾呢?如果周围的句子有这样的字符串“Mr.John”(有点),那么我们不能用“.”来拆分每个句子,这样我就不必总是使用正则表达式了。但是,如果引用不是在句子的末尾呢?如果周围的句子有像“John先生”(有点)这样的字符串,那么我们不能用“.”来拆分每个句子呢