在Python中，如何列出POSIX扩展正则表达式“[：空格：]”匹配的所有字符？_Python_Regex_Unicode

在Python中，如何列出POSIX扩展正则表达式“[：空格：]”匹配的所有字符？

python regex unicode

在Python中，如何列出POSIX扩展正则表达式“[：空格：]”匹配的所有字符？,python,regex,unicode,Python,Regex,Unicode,在Python中，如何列出POSIX扩展正则表达式匹配的所有字符是否有一种编程方式来提取[：space://code>所涵盖的Unicode代码点？这会有点困难，因为Python不支持POSIX字符类但是，确实需要（您必须自己安装）我能想到的提取所有匹配[[：space:]的Unicode的唯一方法有点难看：生成包含所有unicode字符的字符串与[[：space:]匹配我确信有一种更好的方法可以在下面的代码中生成stri（所有unicode字符的字符串），所以这里有改进的余地

在Python中，如何列出POSIX扩展正则表达式匹配的所有字符

是否有一种编程方式来提取

[：space://code>所涵盖的Unicode代码点？
这会有点困难，因为Python不支持POSIX字符类
但是，确实需要（您必须自己安装）
我能想到的提取所有匹配[[：space:]
的Unicode的唯一方法有点难看：

生成包含所有unicode字符的字符串
与[[：space:]
匹配

我确信有一种更好的方法可以在下面的代码中生成stri
（所有unicode字符的字符串），所以这里有改进的余地
chrs = [unichr(c) for c in range(0x10ffff+1)] # <-- eww that's not very fast!
# also we go up to 0x10ffff (inclusive) because that's what help(unichr) says.
stri = ''.join(chrs)

import re
# example if we wanted things matching `\s` with `re` module:
re.findall('\s',stri)
# --> [u'\t', u'\n', u'\x0b', u'\x0c', u'\r', u' ']

# If i had the regex module...
# regex.findall("[[:space:]]",stri)

chrs=[unichr（c）表示范围（0x10ffff+1）内的c]#[u'\t'，u'\n'，u'\x0b'，u'\x0c'，u'\r'，u']
#如果我有正则表达式模块。。。
#regex.findall（“[：space:][]”，stri）

（编辑-将变量名从str
修改为stri
，以避免覆盖内置str
模块（！）
使用生成器而不是列表，以及xrange
而不是范围
：
>>> s = u''.join(unichr(c) for c in xrange(0x10ffff+1))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 1, in <genexpr>
ValueError: unichr() arg not in range(0x10000) (narrow Python build)

哎哟：“没有休息空间”等怎么样
那是什么东西unicodedata.name
是您的朋友：
>>> from unicodedata import name
>>> for c in re.findall(r'\s', s, re.UNICODE):
...     print repr(c), name(c, '')
...
u'\t'
u'\n'
u'\x0b'
u'\x0c'
u'\r'
u'\x1c'
u'\x1d'
u'\x1e'
u'\x1f'
u' ' SPACE
u'\x85'
u'\xa0' NO-BREAK SPACE
u'\u1680' OGHAM SPACE MARK
u'\u180e' MONGOLIAN VOWEL SEPARATOR
u'\u2000' EN QUAD
u'\u2001' EM QUAD
u'\u2002' EN SPACE
u'\u2003' EM SPACE
u'\u2004' THREE-PER-EM SPACE
u'\u2005' FOUR-PER-EM SPACE
u'\u2006' SIX-PER-EM SPACE
u'\u2007' FIGURE SPACE
u'\u2008' PUNCTUATION SPACE
u'\u2009' THIN SPACE
u'\u200a' HAIR SPACE
u'\u2028' LINE SEPARATOR
u'\u2029' PARAGRAPH SEPARATOR
u'\u202f' NARROW NO-BREAK SPACE
u'\u205f' MEDIUM MATHEMATICAL SPACE
u'\u3000' IDEOGRAPHIC SPACE

要更新Python 3的答案，请执行以下操作：
import re
import sys

s = ''.join(chr(c) for c in range(sys.maxunicode+1))
ws = ''.join(re.findall(r'\s', s))
>>> ws.isspace()
True

以下是找到的unicode数据点字符：
>>> ws
'\t\n\x0b\x0c\r\x1c\x1d\x1e\x1f \x85\xa0\u1680\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200a\u2028\u2029\u202f\u205f\u3000'

我们看到这些都被str.strip（）
方法视为空白：
>>> len(ws.strip())
0

以下是有关角色的更多信息：
from unicodedata import name, category
for char in ws:
    print(hex(ord(char)), repr(char), category(char), name(char, None))

在Python 3.5中，对我来说，打印：
0x9 '\t' Cc None
0xa '\n' Cc None
0xb '\x0b' Cc None
0xc '\x0c' Cc None
0xd '\r' Cc None
0x1c '\x1c' Cc None
0x1d '\x1d' Cc None
0x1e '\x1e' Cc None
0x1f '\x1f' Cc None
0x20 ' ' Zs SPACE
0x85 '\x85' Cc None
0xa0 '\xa0' Zs NO-BREAK SPACE
0x1680 '\u1680' Zs OGHAM SPACE MARK
0x2000 '\u2000' Zs EN QUAD
0x2001 '\u2001' Zs EM QUAD
0x2002 '\u2002' Zs EN SPACE
0x2003 '\u2003' Zs EM SPACE
0x2004 '\u2004' Zs THREE-PER-EM SPACE
0x2005 '\u2005' Zs FOUR-PER-EM SPACE
0x2006 '\u2006' Zs SIX-PER-EM SPACE
0x2007 '\u2007' Zs FIGURE SPACE
0x2008 '\u2008' Zs PUNCTUATION SPACE
0x2009 '\u2009' Zs THIN SPACE
0x200a '\u200a' Zs HAIR SPACE
0x2028 '\u2028' Zl LINE SEPARATOR
0x2029 '\u2029' Zp PARAGRAPH SEPARATOR
0x202f '\u202f' Zs NARROW NO-BREAK SPACE
0x205f '\u205f' Zs MEDIUM MATHEMATICAL SPACE
0x3000 '\u3000' Zs IDEOGRAPHIC SPACE

您使用的是特定模块吗<代码>\s

与“\t\n\r\f\v”匹配。您需要这些信息做什么？如果只是出于好奇，您可以在Unicode数据库中搜索与whitespace属性匹配的所有字符。遗憾的是，Python

unicodedata

模块没有提供枚举或迭代一组代码点的功能，当然不是按属性进行的。@Problemaniac，github链接是broken@BiGYaN我明确地添加了代码。那么您希望范围结束在哪里？但问题在于，并非该范围内的所有代码点都有效。

help（unichr）

表示

unichr（i）

对

0有效
from unicodedata import name, category
for char in ws:
    print(hex(ord(char)), repr(char), category(char), name(char, None))

0x9 '\t' Cc None
0xa '\n' Cc None
0xb '\x0b' Cc None
0xc '\x0c' Cc None
0xd '\r' Cc None
0x1c '\x1c' Cc None
0x1d '\x1d' Cc None
0x1e '\x1e' Cc None
0x1f '\x1f' Cc None
0x20 ' ' Zs SPACE
0x85 '\x85' Cc None
0xa0 '\xa0' Zs NO-BREAK SPACE
0x1680 '\u1680' Zs OGHAM SPACE MARK
0x2000 '\u2000' Zs EN QUAD
0x2001 '\u2001' Zs EM QUAD
0x2002 '\u2002' Zs EN SPACE
0x2003 '\u2003' Zs EM SPACE
0x2004 '\u2004' Zs THREE-PER-EM SPACE
0x2005 '\u2005' Zs FOUR-PER-EM SPACE
0x2006 '\u2006' Zs SIX-PER-EM SPACE
0x2007 '\u2007' Zs FIGURE SPACE
0x2008 '\u2008' Zs PUNCTUATION SPACE
0x2009 '\u2009' Zs THIN SPACE
0x200a '\u200a' Zs HAIR SPACE
0x2028 '\u2028' Zl LINE SEPARATOR
0x2029 '\u2029' Zp PARAGRAPH SEPARATOR
0x202f '\u202f' Zs NARROW NO-BREAK SPACE
0x205f '\u205f' Zs MEDIUM MATHEMATICAL SPACE
0x3000 '\u3000' Zs IDEOGRAPHIC SPACE