Python nltk正则表达式标记器
我尝试在python中使用nltk实现正则表达式标记器,但结果是:Python nltk正则表达式标记器,python,regex,pattern-matching,nltk,Python,Regex,Pattern Matching,Nltk,我尝试在python中使用nltk实现正则表达式标记器,但结果是: >>> import nltk >>> text = 'That U.S.A. poster-print costs $12.40...' >>> pattern = r'''(?x) # set flag to allow verbose regexps ... ([A-Z]\.)+ # abbreviations, e.g. U.S.A. ..
>>> import nltk
>>> text = 'That U.S.A. poster-print costs $12.40...'
>>> pattern = r'''(?x) # set flag to allow verbose regexps
... ([A-Z]\.)+ # abbreviations, e.g. U.S.A.
... | \w+(-\w+)* # words with optional internal hyphens
... | \$?\d+(\.\d+)?%? # currency and percentages, e.g. $12.40, 82%
... | \.\.\. # ellipsis
... | [][.,;"'?():-_`] # these are separate tokens; includes ], [
... '''
>>> nltk.regexp_tokenize(text, pattern)
[('', '', ''), ('', '', ''), ('', '-print', ''), ('', '', ''), ('', '', '')]
['That', 'U.S.A.', 'poster-print', 'costs', '$12.40', '...']
但想要的结果是:
>>> import nltk
>>> text = 'That U.S.A. poster-print costs $12.40...'
>>> pattern = r'''(?x) # set flag to allow verbose regexps
... ([A-Z]\.)+ # abbreviations, e.g. U.S.A.
... | \w+(-\w+)* # words with optional internal hyphens
... | \$?\d+(\.\d+)?%? # currency and percentages, e.g. $12.40, 82%
... | \.\.\. # ellipsis
... | [][.,;"'?():-_`] # these are separate tokens; includes ], [
... '''
>>> nltk.regexp_tokenize(text, pattern)
[('', '', ''), ('', '', ''), ('', '-print', ''), ('', '', ''), ('', '', '')]
['That', 'U.S.A.', 'poster-print', 'costs', '$12.40', '...']
为什么??错误在哪里?您应该将所有捕获组都设置为非捕获:
([A-Z]\)+
(?:[A-Z]\)+
->\w+(\w+)*
\w+(?:-\w+*
到\$?\d+(\.\d+)%?
\$?\d+(?:\.\d+)%?
regexp\u tokenize
似乎在使用re.findall
在模式中定义多个捕获组时返回捕获元组列表。见:
模式(str)
–用于构建此标记器的模式(此模式不能包含捕获括号;请使用非捕获括号,例如(?:…)
另外,我不确定您是否要使用与包含所有大写字母的范围相匹配的:-\u
,请将-
放在字符类的末尾
因此,使用
pattern = r'''(?x) # set flag to allow verbose regexps
(?:[A-Z]\.)+ # abbreviations, e.g. U.S.A.
| \w+(?:-\w+)* # words with optional internal hyphens
| \$?\d+(?:\.\d+)?%? # currency and percentages, e.g. $12.40, 82%
| \.\.\. # ellipsis
| [][.,;"'?():_`-] # these are separate tokens; includes ], [
'''
尝试从nltk.tokenize导入RegexpTokenizer,
tokenizer=RegexpTokenizer(模式)
,然后在我的笔记本中返回tokenizer.tokenize(文本)
。也许是版本问题?(3.0.4)我尝试使用Python3.5,但结果是:[('','','',('','','','',('','','')('','','','','','')('','','',''),('','','')]啊哈,你应该把所有捕获组都变成非捕获组<代码>([A-Z]\.)+(?:[A-Z]\.)+
,\w+(\w+)*
->\w+(?:-\w+)*
和\$?\d+(\.\d+)%请参见