Python2.x与3.x中的正则表达式unicode
我有一个用于标记单词的简单函数Python2.x与3.x中的正则表达式unicode,python,regex,python-2.7,unicode,Python,Regex,Python 2.7,Unicode,我有一个用于标记单词的简单函数 import re def tokenize(string): return re.split("(\W+)(?<!')",string,re.UNICODE) 在python 3.5.0中,我得到了以下信息: In [6]: tokenize('perché.') Out[6]: ['perché', '.', ''] 问题是不应将“é”视为标记化的字符。我认为re.UNICODE足以让\W以我的意思工作 如何在Python2.x中获得与Pyt
import re
def tokenize(string):
return re.split("(\W+)(?<!')",string,re.UNICODE)
在python 3.5.0中,我得到了以下信息:
In [6]: tokenize('perché.')
Out[6]: ['perché', '.', '']
问题是不应将“é”视为标记化的字符。我认为re.UNICODE
足以让\W
以我的意思工作
如何在Python2.x中获得与Python3.x相同的行为?您可能希望使用Unicode字符串,但是
split
的第三个参数不是flags
,而是maxslit
:
>>> help(re.split)
Help on function split in module re:
split(pattern, string, maxsplit=0, flags=0)
Split the source string by the occurrences of the pattern,
returning a list containing the resulting substrings. If
capturing parentheses are used in pattern, then the text of all
groups in the pattern are also returned as part of the resulting
list. If maxsplit is nonzero, at most maxsplit splits occur,
and the remainder of the string is returned as the final element
of the list.
例如:
#!coding:utf8
from __future__ import print_function
import re
def tokenize(string):
return re.split(r"(\W+)(?<!')",string,flags=re.UNICODE)
print(tokenize(u'perché.'))
你能试试2.7中的
u'perché.
吗?标记化(u'perché.)->Out[14]:[u'perch',u'\xe9.]。和以前一样。
#!coding:utf8
from __future__ import print_function
import re
def tokenize(string):
return re.split(r"(\W+)(?<!')",string,flags=re.UNICODE)
print(tokenize(u'perché.'))
C:\>py -2 test.py
[u'perch\xe9', u'.', u'']
C:\>py -3 test.py
['perché', '.', '']