Python2.x与3.x中的正则表达式unicode_Python_Regex_Python 2.7_Unicode

Python2.x与3.x中的正则表达式unicode

python regex python-2.7 unicode

Python2.x与3.x中的正则表达式unicode,python,regex,python-2.7,unicode,Python,Regex,Python 2.7,Unicode,我有一个用于标记单词的简单函数 import re def tokenize(string): return re.split("(\W+)(?<!')",string,re.UNICODE) 在python 3.5.0中，我得到了以下信息： In [6]: tokenize('perché.') Out[6]: ['perché', '.', ''] 问题是不应将“é”视为标记化的字符。我认为re.UNICODE足以让\W以我的意思工作如何在Python2.x中获得与Pyt

我有一个用于标记单词的简单函数

import re
def tokenize(string):
    return re.split("(\W+)(?<!')",string,re.UNICODE)

在python 3.5.0中，我得到了以下信息：

In [6]: tokenize('perché.')
Out[6]: ['perché', '.', '']

问题是不应将“é”视为标记化的字符。我认为

re.UNICODE

足以让

\W

以我的意思工作

如何在Python2.x中获得与Python3.x相同的行为？

您可能希望使用Unicode字符串，但是

split

的第三个参数不是

flags

，而是

maxslit

：

>>> help(re.split)
Help on function split in module re:

split(pattern, string, maxsplit=0, flags=0)
    Split the source string by the occurrences of the pattern,
    returning a list containing the resulting substrings.  If
    capturing parentheses are used in pattern, then the text of all
    groups in the pattern are also returned as part of the resulting
    list.  If maxsplit is nonzero, at most maxsplit splits occur,
    and the remainder of the string is returned as the final element
    of the list.

例如：

#!coding:utf8
from __future__ import print_function
import re
def tokenize(string):
    return re.split(r"(\W+)(?<!')",string,flags=re.UNICODE)

print(tokenize(u'perché.'))

你能试试2.7中的

u'perché.

吗？标记化（u'perché.）->Out[14]：[u'perch'，u'\xe9.]。和以前一样。

#!coding:utf8
from __future__ import print_function
import re
def tokenize(string):
    return re.split(r"(\W+)(?<!')",string,flags=re.UNICODE)

print(tokenize(u'perché.'))

C:\>py -2 test.py
[u'perch\xe9', u'.', u'']

C:\>py -3 test.py
['perché', '.', '']