Python 包含引号的文本的句子标记化

Python 包含引号的文本的句子标记化,python,nlp,nltk,tokenize,Python,Nlp,Nltk,Tokenize,代码: 输出: from nltk.tokenize import sent_tokenize pprint(sent_tokenize(unidecode(text))) 输入: 杜因窒息而死后,她的男朋友发表了一篇令人心碎的文章 网上留言:“失去知觉在我的手臂,你的呼吸和呼吸 心跳越来越弱,最后他们把你推出了房间 冰冷的急诊室,我没能保护你。” 23岁的李娜(音译)来自江西省一个农家,是一名农民工, 我期待着2015年结婚 在前面的句子中应该加上引号。而不是“Li.

代码:

输出:

from nltk.tokenize import sent_tokenize           
pprint(sent_tokenize(unidecode(text)))
输入:

杜因窒息而死后,她的男朋友发表了一篇令人心碎的文章 网上留言:“失去知觉在我的手臂,你的呼吸和呼吸 心跳越来越弱,最后他们把你推出了房间 冰冷的急诊室,我没能保护你。”

23岁的李娜(音译)来自江西省一个农家,是一名农民工, 我期待着2015年结婚

在前面的句子中应该加上引号。而不是
“Li.

它在
处失败。如何修复此问题

编辑: 解释文本的提取

[After Du died of suffocation, her boyfriend posted a heartbreaking message online: "Losing consciousness in my arms, your breath and heartbeat became weaker and weaker.',
 'Finally they pushed you out of the cold emergency room.',
 'I failed to protect you.',
 '"Li Na, 23, a migrant worker from a farming family in Jiangxi province, was looking forward to getting married in 2015.',]
这里,article.u文本是unicode格式的。使用此选项将字符更改为“的想法”

解决方案@alvas不正确的结果:

html = open(path, "r").read()                           #reads html code
article = extractor.extract(raw_html=html)              #extracts content
text = unidecode(article.cleaned_text)                  #changes encoding 
Edit2: (更新)nltk和python版本

['After Du died of suffocation, her boyfriend posted a heartbreaking message online: "Losing consciousness in my arms, your breath and heartbeat became weaker and weaker.',
 'Finally they pushed you out of the cold emergency room.',
 'I failed to protect you.',
 '"',
 'Li Na, 23, a migrant worker from a farming family in Jiangxi province, was looking forward to getting married in 2015.'
]

我不确定所需的输出是什么,但我认为您可能需要在
nltk.sent\u tokenize
之前进行一些段落分段,即:

python -c "import nltk; print nltk.__version__"
3.0.4
python -V
Python 2.7.9
可能,您也可能想要,如果是这样,您可以尝试以下方法:

>>> text = """After Du died of suffocation, her boyfriend posted a heartbreaking message online: "Losing consciousness in my arms, your breath and heartbeat became weaker and weaker. Finally they pushed you out of the cold emergency room. I failed to protect you."
... 
... Li Na, 23, a migrant worker from a farming family in Jiangxi province, was looking forward to getting married in 2015."""
>>> from nltk import sent_tokenize
>>> paragraphs = text.split('\n\n')
>>> for pg in paragraphs:
...     for sent in sent_tokenize(pg):
...             print sent
... 
After Du died of suffocation, her boyfriend posted a heartbreaking message online: "Losing consciousness in my arms, your breath and heartbeat became weaker and weaker.
Finally they pushed you out of the cold emergency room.
I failed to protect you."
Li Na, 23, a migrant worker from a farming family in Jiangxi province, was looking forward to getting married in 2015.
或者你可能需要这个:

>>> import re
>>> str_in_doublequotes = r'"([^"]*)"'
>>> re.findall(str_in_doublequotes, text)
['Losing consciousness in my arms, your breath and heartbeat became weaker and weaker. Finally they pushed you out of the cold emergency room. I failed to protect you.']
从文件读取时,请尝试使用包:

以及段落和引用的摘录技巧:

alvas@ubi:~$ echo -e """After Du died of suffocation, her boyfriend posted a heartbreaking message online: \"Losing consciousness in my arms, your breath and heartbeat became weaker and weaker. Finally they pushed you out of the cold emergency room. I failed to protect you.\"\n\nLi Na, 23, a migrant worker from a farming family in Jiangxi province, was looking forward to getting married in 2015.""" > in.txt
alvas@ubi:~$ cat in.txt 
After Du died of suffocation, her boyfriend posted a heartbreaking message online: "Losing consciousness in my arms, your breath and heartbeat became weaker and weaker. Finally they pushed you out of the cold emergency room. I failed to protect you."

Li Na, 23, a migrant worker from a farming family in Jiangxi province, was looking forward to getting married in 2015.
alvas@ubi:~$ python
Python 2.7.6 (default, Jun 22 2015, 17:58:13) 
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import io
>>> from nltk import sent_tokenize
>>> text = io.open('in.txt', 'r', encoding='utf8').read()
>>> print text
After Du died of suffocation, her boyfriend posted a heartbreaking message online: "Losing consciousness in my arms, your breath and heartbeat became weaker and weaker. Finally they pushed you out of the cold emergency room. I failed to protect you."

Li Na, 23, a migrant worker from a farming family in Jiangxi province, was looking forward to getting married in 2015.

>>> for sent in sent_tokenize(text):
...     print sent
... 
After Du died of suffocation, her boyfriend posted a heartbreaking message online: "Losing consciousness in my arms, your breath and heartbeat became weaker and weaker.
Finally they pushed you out of the cold emergency room.
I failed to protect you."
Li Na, 23, a migrant worker from a farming family in Jiangxi province, was looking forward to getting married in 2015.
对于将引用前的句子与引号连接起来的魔术(不要眨眼,它看起来与上面的完全相同):

上述代码的问题在于,它仅限于以下句子:

杜因窒息而死后,她的男朋友发表了一篇令人心碎的文章 网上留言:“失去知觉在我的手臂,你的呼吸和呼吸 心跳越来越弱,最后他们把你推出了房间 冰冷的急诊室,我没能保护你。”

无法处理:

“在我的臂弯里失去知觉,你的呼吸和心跳变得 越来越弱。最后他们把你从寒冷的紧急状态中推了出来 “我没能保护你,”她的男朋友发了一条令人心碎的帖子 杜因窒息死亡后的在线消息

为了确保这一点,我的python/nltk版本是:

>>> import io, re
>>> from nltk import sent_tokenize
>>> str_in_doublequotes = r'"([^"]*)"'
>>> paragraphs = text.split('\n\n')
>>> for pg in paragraphs:
...     # Collects the quotes inside the paragraph 
...     in_quotes = re.findall(str_in_doublequotes, pg)
...     for q in in_quotes:
...             # Keep track of the quotes with tabs.
...             pg = pg.replace('"{}"'.format(q), '\t')
...     for _pg in pg.split('\t'):
...             for sent in sent_tokenize(_pg):
...                     print sent,
...             try:
...                     print '"{}"'.format(in_quotes.pop(0))
...             except IndexError: # Nothing to pop.
...                     pass
... 
After Du died of suffocation, her boyfriend posted a heartbreaking message online:  "Losing consciousness in my arms, your breath and heartbeat became weaker and weaker. Finally they pushed you out of the cold emergency room. I failed to protect you."
Li Na, 23, a migrant worker from a farming family in Jiangxi province, was looking forward to getting married in 2015.

除了文本处理的计算方面,问题中文本的语法也有微妙的不同

引号后面跟着分号
这一事实与传统英语语法不符。这可能在中文新闻中得到普及,因为在中文中:

啊杜窒息死亡后,男友在网上发了令人心碎的消息: "..."

在传统英语中,从非常规范的语法意义上讲,它应该是:

杜因窒息而死后,她的男朋友发表了一篇令人心碎的文章 联机消息,“…”

引用后的语句应该用结束逗号而不是句号来表示,例如:

她的男朋友在杜后在网上发布了一条令人心碎的消息 死于窒息


输出应该是什么样的?自然语言解析很困难。您可以尝试查看标记器为什么会这样做。看起来默认的句子标记器不会将引号识别为标点符号。您可以指定
通过创建对象作为可能的句子边界字符。@Raniz引号应包含在前一句中。而不是
“Li.
@augurar不确定这是否是一个好的解决方案。因为我在许多文档上都这样做。除了
”之类的内容之外,它是。。。。。。。"该公司说。
谢谢你令人惊讶的回答!请检查我的编辑。似乎仍然有问题。谢谢你的及时回复!但我希望保留引号,并在稍后使用正则表达式,正如你所说的那样!@AbhishekBhatia,最终的期望输出是否如当前答案所示?令人惊讶!但仍然有一个问题,pl请查看上面的edit2。还有一点我忘记了:这是一句话:
杜因窒息而死后,她的男朋友在网上发布了一条令人心碎的消息:“我失去了知觉,你的呼吸和心跳变得越来越弱。最后他们把你推出冰冷的急诊室。我没能保护你。”
>>> import io, re
>>> from nltk import sent_tokenize
>>> str_in_doublequotes = r'"([^"]*)"'
>>> paragraphs = text.split('\n\n')
>>> for pg in paragraphs:
...     # Collects the quotes inside the paragraph 
...     in_quotes = re.findall(str_in_doublequotes, pg)
...     for q in in_quotes:
...             # Keep track of the quotes with tabs.
...             pg = pg.replace('"{}"'.format(q), '\t')
...     for _pg in pg.split('\t'):
...             for sent in sent_tokenize(_pg):
...                     print sent
...             try:
...                     print '"{}"'.format(in_quotes.pop(0))
...             except IndexError: # Nothing to pop.
...                     pass
... 
After Du died of suffocation, her boyfriend posted a heartbreaking message online: 
"Losing consciousness in my arms, your breath and heartbeat became weaker and weaker. Finally they pushed you out of the cold emergency room. I failed to protect you."
Li Na, 23, a migrant worker from a farming family in Jiangxi province, was looking forward to getting married in 2015.
>>> import io, re
>>> from nltk import sent_tokenize
>>> str_in_doublequotes = r'"([^"]*)"'
>>> paragraphs = text.split('\n\n')
>>> for pg in paragraphs:
...     # Collects the quotes inside the paragraph 
...     in_quotes = re.findall(str_in_doublequotes, pg)
...     for q in in_quotes:
...             # Keep track of the quotes with tabs.
...             pg = pg.replace('"{}"'.format(q), '\t')
...     for _pg in pg.split('\t'):
...             for sent in sent_tokenize(_pg):
...                     print sent,
...             try:
...                     print '"{}"'.format(in_quotes.pop(0))
...             except IndexError: # Nothing to pop.
...                     pass
... 
After Du died of suffocation, her boyfriend posted a heartbreaking message online:  "Losing consciousness in my arms, your breath and heartbeat became weaker and weaker. Finally they pushed you out of the cold emergency room. I failed to protect you."
Li Na, 23, a migrant worker from a farming family in Jiangxi province, was looking forward to getting married in 2015.
$ python -c "import nltk; print nltk.__version__"
'3.0.3'
$ python -V
Python 2.7.6