Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/335.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python2.x与3.x中的正则表达式unicode_Python_Regex_Python 2.7_Unicode - Fatal编程技术网

Python2.x与3.x中的正则表达式unicode

Python2.x与3.x中的正则表达式unicode,python,regex,python-2.7,unicode,Python,Regex,Python 2.7,Unicode,我有一个用于标记单词的简单函数 import re def tokenize(string): return re.split("(\W+)(?<!')",string,re.UNICODE) 在python 3.5.0中,我得到了以下信息: In [6]: tokenize('perché.') Out[6]: ['perché', '.', ''] 问题是不应将“é”视为标记化的字符。我认为re.UNICODE足以让\W以我的意思工作 如何在Python2.x中获得与Pyt

我有一个用于标记单词的简单函数

import re
def tokenize(string):
    return re.split("(\W+)(?<!')",string,re.UNICODE)
在python 3.5.0中,我得到了以下信息:

In [6]: tokenize('perché.')
Out[6]: ['perché', '.', '']
问题是不应将“é”视为标记化的字符。我认为
re.UNICODE
足以让
\W
以我的意思工作


如何在Python2.x中获得与Python3.x相同的行为?

您可能希望使用Unicode字符串,但是
split
的第三个参数不是
flags
,而是
maxslit

>>> help(re.split)
Help on function split in module re:

split(pattern, string, maxsplit=0, flags=0)
    Split the source string by the occurrences of the pattern,
    returning a list containing the resulting substrings.  If
    capturing parentheses are used in pattern, then the text of all
    groups in the pattern are also returned as part of the resulting
    list.  If maxsplit is nonzero, at most maxsplit splits occur,
    and the remainder of the string is returned as the final element
    of the list.
例如:

#!coding:utf8
from __future__ import print_function
import re
def tokenize(string):
    return re.split(r"(\W+)(?<!')",string,flags=re.UNICODE)

print(tokenize(u'perché.'))

你能试试2.7中的
u'perché.
吗?标记化(u'perché.)->Out[14]:[u'perch',u'\xe9.]。和以前一样。
#!coding:utf8
from __future__ import print_function
import re
def tokenize(string):
    return re.split(r"(\W+)(?<!')",string,flags=re.UNICODE)

print(tokenize(u'perché.'))
C:\>py -2 test.py
[u'perch\xe9', u'.', u'']

C:\>py -3 test.py
['perché', '.', '']