Python Nltk单词标记器将结尾的单引号视为单独的单词
以下是IPython笔记本中的代码片段:Python Nltk单词标记器将结尾的单引号视为单独的单词,python,nltk,Python,Nltk,以下是IPython笔记本中的代码片段: test = "'v'" words = word_tokenize(test) words 输出为: ["'v", "'"] 正如您所见,结尾的单引号被视为一个单独的单词,而第一个引号是“v”的一部分。我想要 ["'v'"] 或 有什么方法可以实现这一点吗?尝试从nltk.tokenize.moses导入MosesTokenizer、MosesDetokenizer from nltk.tokenize.moses import MosesTok
test = "'v'"
words = word_tokenize(test)
words
输出为:
["'v", "'"]
正如您所见,结尾的单引号被视为一个单独的单词,而第一个引号是“v”的一部分。我想要
["'v'"]
或
有什么方法可以实现这一点吗?尝试从nltk.tokenize.moses导入MosesTokenizer、MosesDetokenizer
from nltk.tokenize.moses import MosesTokenizer, MosesDetokenizer
t, d = MosesTokenizer(), MosesDetokenizer()
tokens = t.tokenize(test)
tokens
[''v'']
其中&apos代码>='
您还可以使用escape=False
参数来防止XML特殊字符的转义:
>>> m.tokenize("'v'", escape=False)
["'v'"]
保持'v'
的输出与一致,即
~/mosesdecoder/scripts/tokenizer$perl tokenizer.perl-l en
如果你想探索和处理单引号,也有很多方法 似乎这不是一个bug,而是nltk.word\u tokenize()
的预期输出
这与Robert McIntyre的树库单词标记器一致
正如@Prateek所指出的,您可以尝试其他可能适合您需要的标记化程序
更有趣的问题是为什么开头的单引号会与以下字符保持一致?
难道我们不能像在上所做的那样,破解树形文字标记器
[out]:
["'", 'v', "'"]
>>> print(word_tokenize("The 'v', I've been fooled but I'll seek revenge."))
['The', "'", 'v', "'", ',', 'I', "'ve", 'been', 'fooled', 'but', 'I', "'ll", 'seek', 'revenge', '.']
>>> word_tokenize("'v' 're'")
["'", 'v', "'", "'re", "'"]
是的,修改将对OP中的字符串起作用,但它将开始破坏所有字符串,例如
请注意,原始的nltk.word\u tokenize()
将起始单引号保留在clitics中,并将其输出:
>>> print(nltk.word_tokenize("'v', I've been fooled but I'll seek revenge."))
["'v", "'", ',', 'I', "'ve", 'been', 'fooled', 'but', 'I', "'ll", 'seek', 'revenge', '.']
有一些策略可以处理结尾的引号,但不能处理开头的引号
但这个“问题”的主要原因是,单词标记器没有平衡引号标记的意识。如果我们看一下,有更多的机制来处理引号
有趣的是,斯坦福大学的CoreNLP并没有这样做
在终端:
wget http://nlp.stanford.edu/software/stanford-corenlp-full-2016-10-31.zip
unzip stanford-corenlp-full-2016-10-31.zip && cd stanford-corenlp-full-2016-10-31
java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer \
-preload tokenize,ssplit,pos,lemma,parse,depparse \
-status_port 9000 -port 9000 -timeout 15000
Python:
>>> from nltk.parse.corenlp import CoreNLPParser
>>> parser = CoreNLPParser()
>>> parser.tokenize("'v'")
<generator object GenericCoreNLPParser.tokenize at 0x1148f9af0>
>>> list(parser.tokenize("'v'"))
["'", 'v', "'"]
>>> list(parser.tokenize("I've"))
['I', "'", 've']
>>> list(parser.tokenize("I've'"))
['I', "'ve", "'"]
>>> list(parser.tokenize("I'lk'"))
['I', "'", 'lk', "'"]
>>> list(parser.tokenize("I'lk"))
['I', "'", 'lk']
>>> list(parser.tokenize("I'll"))
['I', "'", 'll']
可以将正则表达式添加到补丁word\u tokenize
,例如
>>> import re
>>> pattern = re.compile(r"(?i)(\')(?!ve|ll|t)(\w)\b")
>>> pattern.sub(r'\1 \2', x)
"I'll be going home I've the ' v ' isn't want I want to split but I want to catch tokens like ' v and ' w ' ."
>>> x = "I 'll be going home I 've the 'v ' isn't want I want to split but I want to catch tokens like 'v and 'w ' ."
>>> pattern.sub(r'\1 \2', x)
"I 'll be going home I 've the ' v ' isn't want I want to split but I want to catch tokens like ' v and ' w ' ."
所以我们可以做一些类似的事情:
import re
from nltk.tokenize.treebank import TreebankWordTokenizer
# Standard word tokenizer.
_treebank_word_tokenizer = TreebankWordTokenizer()
# See discussion on https://github.com/nltk/nltk/pull/1437
# Adding to TreebankWordTokenizer, the splits on
# - chervon quotes u'\xab' and u'\xbb' .
# - unicode quotes u'\u2018', u'\u2019', u'\u201c' and u'\u201d'
improved_open_quote_regex = re.compile(u'([«“‘„]|[`]+)', re.U)
improved_open_single_quote_regex = re.compile(r"(?i)(\')(?!re|ve|ll|m|t|s|d)(\w)\b", re.U)
improved_close_quote_regex = re.compile(u'([»”’])', re.U)
improved_punct_regex = re.compile(r'([^\.])(\.)([\]\)}>"\'' u'»”’ ' r']*)\s*$', re.U)
_treebank_word_tokenizer.STARTING_QUOTES.insert(0, (improved_open_quote_regex, r' \1 '))
_treebank_word_tokenizer.STARTING_QUOTES.append((improved_open_single_quote_regex, r'\1 \2'))
_treebank_word_tokenizer.ENDING_QUOTES.insert(0, (improved_close_quote_regex, r' \1 '))
_treebank_word_tokenizer.PUNCTUATION.insert(0, (improved_punct_regex, r'\1 \2 \3 '))
def word_tokenize(text, language='english', preserve_line=False):
"""
Return a tokenized copy of *text*,
using NLTK's recommended word tokenizer
(currently an improved :class:`.TreebankWordTokenizer`
along with :class:`.PunktSentenceTokenizer`
for the specified language).
:param text: text to split into words
:type text: str
:param language: the model name in the Punkt corpus
:type language: str
:param preserve_line: An option to keep the preserve the sentence and not sentence tokenize it.
:type preserver_line: bool
"""
sentences = [text] if preserve_line else sent_tokenize(text, language)
return [token for sent in sentences
for token in _treebank_word_tokenizer.tokenize(sent)]
[out]:
["'", 'v', "'"]
>>> print(word_tokenize("The 'v', I've been fooled but I'll seek revenge."))
['The', "'", 'v', "'", ',', 'I', "'ve", 'been', 'fooled', 'but', 'I', "'ll", 'seek', 'revenge', '.']
>>> word_tokenize("'v' 're'")
["'", 'v', "'", "'re", "'"]
这很奇怪。应该是一个bug。似乎这也不是bug=(提出的问题:这看起来也像bug!刚刚确认原始moses标记器也会这样做。
>>> list(parser.tokenize("'re"))
["'", 're']
>>> list(parser.tokenize("you're"))
['you', "'", 're']
>>> list(parser.tokenize("you're'"))
['you', "'re", "'"]
>>> list(parser.tokenize("you 're'"))
['you', "'re", "'"]
>>> list(parser.tokenize("you the 're'"))
['you', 'the', "'re", "'"]
>>> import re
>>> pattern = re.compile(r"(?i)(\')(?!ve|ll|t)(\w)\b")
>>> pattern.sub(r'\1 \2', x)
"I'll be going home I've the ' v ' isn't want I want to split but I want to catch tokens like ' v and ' w ' ."
>>> x = "I 'll be going home I 've the 'v ' isn't want I want to split but I want to catch tokens like 'v and 'w ' ."
>>> pattern.sub(r'\1 \2', x)
"I 'll be going home I 've the ' v ' isn't want I want to split but I want to catch tokens like ' v and ' w ' ."
import re
from nltk.tokenize.treebank import TreebankWordTokenizer
# Standard word tokenizer.
_treebank_word_tokenizer = TreebankWordTokenizer()
# See discussion on https://github.com/nltk/nltk/pull/1437
# Adding to TreebankWordTokenizer, the splits on
# - chervon quotes u'\xab' and u'\xbb' .
# - unicode quotes u'\u2018', u'\u2019', u'\u201c' and u'\u201d'
improved_open_quote_regex = re.compile(u'([«“‘„]|[`]+)', re.U)
improved_open_single_quote_regex = re.compile(r"(?i)(\')(?!re|ve|ll|m|t|s|d)(\w)\b", re.U)
improved_close_quote_regex = re.compile(u'([»”’])', re.U)
improved_punct_regex = re.compile(r'([^\.])(\.)([\]\)}>"\'' u'»”’ ' r']*)\s*$', re.U)
_treebank_word_tokenizer.STARTING_QUOTES.insert(0, (improved_open_quote_regex, r' \1 '))
_treebank_word_tokenizer.STARTING_QUOTES.append((improved_open_single_quote_regex, r'\1 \2'))
_treebank_word_tokenizer.ENDING_QUOTES.insert(0, (improved_close_quote_regex, r' \1 '))
_treebank_word_tokenizer.PUNCTUATION.insert(0, (improved_punct_regex, r'\1 \2 \3 '))
def word_tokenize(text, language='english', preserve_line=False):
"""
Return a tokenized copy of *text*,
using NLTK's recommended word tokenizer
(currently an improved :class:`.TreebankWordTokenizer`
along with :class:`.PunktSentenceTokenizer`
for the specified language).
:param text: text to split into words
:type text: str
:param language: the model name in the Punkt corpus
:type language: str
:param preserve_line: An option to keep the preserve the sentence and not sentence tokenize it.
:type preserver_line: bool
"""
sentences = [text] if preserve_line else sent_tokenize(text, language)
return [token for sent in sentences
for token in _treebank_word_tokenizer.tokenize(sent)]
>>> print(word_tokenize("The 'v', I've been fooled but I'll seek revenge."))
['The', "'", 'v', "'", ',', 'I', "'ve", 'been', 'fooled', 'but', 'I', "'ll", 'seek', 'revenge', '.']
>>> word_tokenize("'v' 're'")
["'", 'v', "'", "'re", "'"]