Python NLTK句子标记器,将新行视为句子边界

Python NLTK句子标记器,将新行视为句子边界,python,nlp,nltk,tokenize,Python,Nlp,Nltk,Tokenize,我正在使用nltk的PunkSentenceTokenizer将文本标记为一组句子。然而,记号器似乎不把新段落或新行看作一个新的句子。 >>> from nltk.tokenize.punkt import PunktSentenceTokenizer >>> tokenizer = PunktSentenceTokenizer() >>> tokenizer.tokenize('Sentence 1 \n Sentence 2. Sente

我正在使用nltk的
PunkSentenceTokenizer
将文本标记为一组句子。然而,记号器似乎不把新段落或新行看作一个新的句子。
>>> from nltk.tokenize.punkt import PunktSentenceTokenizer
>>> tokenizer = PunktSentenceTokenizer()
>>> tokenizer.tokenize('Sentence 1 \n Sentence 2. Sentence 3.')
['Sentence 1 \n Sentence 2.', 'Sentence 3.']
>>> tokenizer.span_tokenize('Sentence 1 \n Sentence 2. Sentence 3.')
[(0, 24), (25, 36)]

我希望它能把新词作为句子的边界。无论如何要这样做(我也需要保存偏移量)?

好吧,我也遇到了同样的问题,我所做的是将文本拆分为'\n'。大概是这样的:

# in my case, when it had '\n', I called it a new paragraph, 
# like a collection of sentences
paragraphs = [p for p in text.split('\n') if p]
# and here, sent_tokenize each one of the paragraphs
for paragraph in paragraphs:
    sentences = tokenizer.tokenize(paragraph)
这是一个简化版的我在生产,但总的想法是一样的。而且,对于葡萄牙语的评论和文档很抱歉,这是为巴西观众的“教育目的”而做的

def paragraphs(self):
    if self._paragraphs is not None:
        for p in  self._paragraphs:
            yield p
    else:
        raw_paras = self.raw_text.split(self.paragraph_delimiter)
        gen = (Paragraph(self, p) for p in raw_paras if p)
        self._paragraphs = []
        for p in gen:
            self._paragraphs.append(p)
            yield p

完整代码

很好的解决方法。但这不适用于我的情况,因为我还想用tokenizer.span_tokenize()保存原始文本的分割点偏移量。尽管我认为可以用点替换换行符。这可能会奏效。在列表理解[text.split('\n')if p中的p代表p]中,'if p'做什么?