Python NTLK中的简单标记化问题

Python NTLK中的简单标记化问题,python,nlp,nltk,Python,Nlp,Nltk,我想标记以下文本: In Düsseldorf I took my hat off. But I can't put it back on. 'In', 'Düsseldorf', 'I', 'took', 'my', 'hat', 'off', '.', 'But', 'I', 'can't', 'put', 'it', 'back', 'on', '.' 但令我惊讶的是,所有这些都没有。我该怎么做?是否可以以某种方式使用这些标记器的组合来实现上述目标?您可以将其中一个标记器作为起点,

我想标记以下文本:

In Düsseldorf I took my hat off. But I can't put it back on.


'In', 'Düsseldorf', 'I', 'took', 'my', 'hat', 'off', '.', 'But', 'I', 
'can't', 'put', 'it', 'back', 'on', '.'

但令我惊讶的是,所有这些都没有。我该怎么做?是否可以以某种方式使用这些标记器的组合来实现上述目标?

您可以将其中一个标记器作为起点,然后修复收缩(假设这是问题):

>>['In','Düsseldorf','I','take','my','hat','off','But','I','cant','put','it','back','on','

在标记单词之前,您应该标记句子:

>>> from nltk import sent_tokenize, word_tokenize
>>> text = "In Düsseldorf I took my hat off. But I can't put it back on."
>>> text = [word_tokenize(s) for s in sent_tokenize(text)]
>>> text
[['In', 'D\xc3\xbcsseldorf', 'I', 'took', 'my', 'hat', 'off', '.'], ['But', 'I', 'ca', "n't", 'put', 'it', 'back', 'on', '.']]
如果要将它们重新放入单个列表中:

>>> from itertools import chain
>>> from nltk import sent_tokenize, word_tokenize
>>> text = "In Düsseldorf I took my hat off. But I can't put it back on."
>>> text = [word_tokenize(s) for s in sent_tokenize(text)]
>>> text
[['In', 'D\xc3\xbcsseldorf', 'I', 'took', 'my', 'hat', 'off', '.'], ['But', 'I', 'ca', "n't", 'put', 'it', 'back', 'on', '.']]
>>> list(chain(*text))
['In', 'D\xc3\xbcsseldorf', 'I', 'took', 'my', 'hat', 'off', '.', 'But', 'I', 'ca', "n't", 'put', 'it', 'back', 'on', '.']
如果您必须将
[“ca”,“n't”]->[“cant”]


谢谢你能告诉我这样做的原因吗?您的代码似乎无法解决当前的问题。勾选“不能”,太好了!谢谢除了你在合同中提到的三个例子之外,你认为还有这样的例子吗
>>> from itertools import chain
>>> from nltk import sent_tokenize, word_tokenize
>>> text = "In Düsseldorf I took my hat off. But I can't put it back on."
>>> text = [word_tokenize(s) for s in sent_tokenize(text)]
>>> text
[['In', 'D\xc3\xbcsseldorf', 'I', 'took', 'my', 'hat', 'off', '.'], ['But', 'I', 'ca', "n't", 'put', 'it', 'back', 'on', '.']]
>>> list(chain(*text))
['In', 'D\xc3\xbcsseldorf', 'I', 'took', 'my', 'hat', 'off', '.', 'But', 'I', 'ca', "n't", 'put', 'it', 'back', 'on', '.']
>>> from itertools import izip_longest, chain
>>> tok_text = list(chain(*[word_tokenize(s) for s in sent_tokenize(text)]))
>>> contractions = ["n't", "'ll", "'re", "'s"]

# Iterate through two words at a time and then join the contractions back.
>>> [w1+w2 if w2 in contractions else w1 for w1,w2 in izip_longest(tok_text, tok_text[1:])]
['In', 'D\xc3\xbcsseldorf', 'I', 'took', 'my', 'hat', 'off', '.', 'But', 'I', "can't", "n't", 'put', 'it', 'back', 'on', '.']
# Remove all contraction tokens since you've joint them to their root stem.
>>> [w for w in [w1+w2 if w2 in contractions else w1 for w1,w2 in izip_longest(tok_text, tok_text[1:])] if w not in contractions]
['In', 'D\xc3\xbcsseldorf', 'I', 'took', 'my', 'hat', 'off', '.', 'But', 'I', "can't", 'put', 'it', 'back', 'on', '.']