Python nltk正则表达式标记器

Python nltk正则表达式标记器,python,regex,pattern-matching,nltk,Python,Regex,Pattern Matching,Nltk,我尝试在python中使用nltk实现正则表达式标记器,但结果是: >>> import nltk >>> text = 'That U.S.A. poster-print costs $12.40...' >>> pattern = r'''(?x) # set flag to allow verbose regexps ... ([A-Z]\.)+ # abbreviations, e.g. U.S.A. ..

我尝试在python中使用nltk实现正则表达式标记器,但结果是:

>>> import nltk
>>> text = 'That U.S.A. poster-print costs $12.40...'
>>> pattern = r'''(?x)    # set flag to allow verbose regexps
...     ([A-Z]\.)+        # abbreviations, e.g. U.S.A.
...   | \w+(-\w+)*        # words with optional internal hyphens
...   | \$?\d+(\.\d+)?%?  # currency and percentages, e.g. $12.40, 82%
...   | \.\.\.            # ellipsis
...   | [][.,;"'?():-_`]  # these are separate tokens; includes ], [
... '''
>>> nltk.regexp_tokenize(text, pattern)
[('', '', ''), ('', '', ''), ('', '-print', ''), ('', '', ''), ('', '', '')]
['That', 'U.S.A.', 'poster-print', 'costs', '$12.40', '...']
但想要的结果是:

>>> import nltk
>>> text = 'That U.S.A. poster-print costs $12.40...'
>>> pattern = r'''(?x)    # set flag to allow verbose regexps
...     ([A-Z]\.)+        # abbreviations, e.g. U.S.A.
...   | \w+(-\w+)*        # words with optional internal hyphens
...   | \$?\d+(\.\d+)?%?  # currency and percentages, e.g. $12.40, 82%
...   | \.\.\.            # ellipsis
...   | [][.,;"'?():-_`]  # these are separate tokens; includes ], [
... '''
>>> nltk.regexp_tokenize(text, pattern)
[('', '', ''), ('', '', ''), ('', '-print', ''), ('', '', ''), ('', '', '')]
['That', 'U.S.A.', 'poster-print', 'costs', '$12.40', '...']

为什么??错误在哪里?

您应该将所有捕获组都设置为非捕获:

  • ([A-Z]\)+
    (?:[A-Z]\)+
  • \w+(\w+)*
    ->
    \w+(?:-\w+*
  • \$?\d+(\.\d+)%?
    \$?\d+(?:\.\d+)%?
问题在于,
regexp\u tokenize
似乎在使用
re.findall
在模式中定义多个捕获组时返回捕获元组列表。见:

模式(str)
–用于构建此标记器的模式(此模式不能包含捕获括号;请使用非捕获括号,例如(?:…)

另外,我不确定您是否要使用与包含所有大写字母的范围相匹配的
:-\u
,请将
-
放在字符类的末尾

因此,使用

pattern = r'''(?x)          # set flag to allow verbose regexps
        (?:[A-Z]\.)+        # abbreviations, e.g. U.S.A.
      | \w+(?:-\w+)*        # words with optional internal hyphens
      | \$?\d+(?:\.\d+)?%?  # currency and percentages, e.g. $12.40, 82%
      | \.\.\.              # ellipsis
      | [][.,;"'?():_`-]    # these are separate tokens; includes ], [
    '''

尝试从nltk.tokenize导入RegexpTokenizer,
tokenizer=RegexpTokenizer(模式)
,然后在我的笔记本中返回
tokenizer.tokenize(文本)
。也许是版本问题?(3.0.4)我尝试使用Python3.5,但结果是:[('','','',('','','','',('','','')('','','','','','')('','','',''),('','','')]啊哈,你应该把所有捕获组都变成非捕获组<代码>([A-Z]\.)+
(?:[A-Z]\.)+
\w+(\w+)*
->
\w+(?:-\w+)*
\$?\d+(\.\d+)%请参见