Python 2.7 spacy中的句子标记化不好（？）_Python 2.7_Nltk_Spacy

Python 2.7 spacy中的句子标记化不好（？）

python-2.7

Python 2.7 spacy中的句子标记化不好（？）,python-2.7,nltk,spacy,Python 2.7,Nltk,Spacy,为什么spacy的分句器/标记器工作不好？nltk似乎运行良好。以下是我的一点经验： import spacy nlp = spacy.load('fr') import nltk text_fr = u"Je suis parti a la boulangerie. J'ai achete trois croissants. C'etait super bon." nltk.sent_tokenize(text_fr) # [u'Je suis parti a la boulangeri

为什么spacy的分句器/标记器工作不好？nltk似乎运行良好。以下是我的一点经验：

import spacy
nlp = spacy.load('fr')
import nltk

text_fr = u"Je suis parti a la boulangerie. J'ai achete trois croissants. C'etait super bon."


nltk.sent_tokenize(text_fr)
# [u'Je suis parti a la boulangerie.',
# u"J'ai achete trois croissants.",
# u"C'etait super bon."


doc = nlp(text_fr)
for s in doc.sents: print s
# Je suis parti
# a la boulangerie. J'ai
# achete trois croissants. C'
# etait super bon.

我注意到英语也有同样的行为。对于这段文字：

text = u"I went to the library. I did not know what book to buy, but then the lady working there helped me. It was cool. I discovered a lot of new things."

我使用spacy（在

nlp=spacy.load（'en'）

之后）获得：

与nltk相比，nltk看起来不错：

[u'I went to the library.',
 u'I did not know what book to buy, but then the lady working there helped me.',
 u'It was cool.',
 u'I discovered a lot of new things.']

我现在不知道该怎么做，但事实证明我使用的是spacy的旧版本（V0.100）。我再次安装了最新的spacy（v2.0.4），现在句子分割更加连贯

，“目前，句子分割是基于依赖性分析，这并不总是产生理想的结果。”我的spacy版本太旧（0.100），v2，spacy按预期工作是的，请更新您的spacy版本。请记住，您可以定义自定义分句器：谢谢更新！您可能希望将此标记为正式答案（请参阅）

[u'I went to the library.',
 u'I did not know what book to buy, but then the lady working there helped me.',
 u'It was cool.',
 u'I discovered a lot of new things.']