Python nltk正则表达式标记器_Python_Regex_Pattern Matching_Nltk

Python nltk正则表达式标记器

python regex

Python nltk正则表达式标记器,python,regex,pattern-matching,nltk,Python,Regex,Pattern Matching,Nltk,我尝试在python中使用nltk实现正则表达式标记器，但结果是： >>> import nltk >>> text = 'That U.S.A. poster-print costs $12.40...' >>> pattern = r'''(?x) # set flag to allow verbose regexps ... ([A-Z]\.)+ # abbreviations, e.g. U.S.A. ..

我尝试在python中使用nltk实现正则表达式标记器，但结果是：

>>> import nltk
>>> text = 'That U.S.A. poster-print costs $12.40...'
>>> pattern = r'''(?x)    # set flag to allow verbose regexps
...     ([A-Z]\.)+        # abbreviations, e.g. U.S.A.
...   | \w+(-\w+)*        # words with optional internal hyphens
...   | \$?\d+(\.\d+)?%?  # currency and percentages, e.g. $12.40, 82%
...   | \.\.\.            # ellipsis
...   | [][.,;"'?():-_`]  # these are separate tokens; includes ], [
... '''
>>> nltk.regexp_tokenize(text, pattern)
[('', '', ''), ('', '', ''), ('', '-print', ''), ('', '', ''), ('', '', '')]

['That', 'U.S.A.', 'poster-print', 'costs', '$12.40', '...']

但想要的结果是：

>>> import nltk
>>> text = 'That U.S.A. poster-print costs $12.40...'
>>> pattern = r'''(?x)    # set flag to allow verbose regexps
...     ([A-Z]\.)+        # abbreviations, e.g. U.S.A.
...   | \w+(-\w+)*        # words with optional internal hyphens
...   | \$?\d+(\.\d+)?%?  # currency and percentages, e.g. $12.40, 82%
...   | \.\.\.            # ellipsis
...   | [][.,;"'?():-_`]  # these are separate tokens; includes ], [
... '''
>>> nltk.regexp_tokenize(text, pattern)
[('', '', ''), ('', '', ''), ('', '-print', ''), ('', '', ''), ('', '', '')]

['That', 'U.S.A.', 'poster-print', 'costs', '$12.40', '...']

为什么?？错误在哪里？

您应该将所有捕获组都设置为非捕获：

```
（[A-Z]\）+
```
```
（？：[A-Z]\）+
```
```
\w+（\w+）*
```
->
```
\w+（？：-\w+*
```

\$？\d+（\.\d+）%？

到

\$？\d+（？：\.\d+）%？

问题在于，

regexp\u tokenize

似乎在使用

re.findall

在模式中定义多个捕获组时返回捕获元组列表。见：

模式（str）
–用于构建此标记器的模式（此模式不能包含捕获括号；请使用非捕获括号，例如（？：…）

另外，我不确定您是否要使用与包含所有大写字母的范围相匹配的

：-\u

，请将

放在字符类的末尾

因此，使用

pattern = r'''(?x)          # set flag to allow verbose regexps
        (?:[A-Z]\.)+        # abbreviations, e.g. U.S.A.
      | \w+(?:-\w+)*        # words with optional internal hyphens
      | \$?\d+(?:\.\d+)?%?  # currency and percentages, e.g. $12.40, 82%
      | \.\.\.              # ellipsis
      | [][.,;"'?():_`-]    # these are separate tokens; includes ], [
    '''

尝试从nltk.tokenize导入RegexpTokenizer，

tokenizer=RegexpTokenizer（模式）

，然后在我的笔记本中返回

tokenizer.tokenize（文本）

。也许是版本问题？（3.0.4）我尝试使用Python3.5，但结果是：[（''，''，''，（''，''，''，''，（''，''，''）（''，''，''，''，''，''）（''，''，''，''），（''，''，''）]啊哈，你应该把所有捕获组都变成非捕获组<代码>（[A-Z]\.）+

（？：[A-Z]\.）+

，

\w+（\w+）*

\w+（？：-\w+）*

和

\$？\d+（\.\d+）%请参见