Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/328.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/regex/19.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
在Python中,如何检查字符串是否只包含某些字符?_Python_Regex_Search_Character - Fatal编程技术网

在Python中,如何检查字符串是否只包含某些字符?

在Python中,如何检查字符串是否只包含某些字符?,python,regex,search,character,Python,Regex,Search,Character,在Python中,如何检查字符串是否只包含某些字符 我需要检查只包含a..z、0..9和的字符串。(句号)且无其他字符 我可以迭代每个字符,检查字符是否为a..z或0..9,或。但这将是缓慢的 我现在不清楚如何用正则表达式实现它 这是正确的吗?你能推荐一个更简单的正则表达式或更有效的方法吗 #Valid chars . a-z 0-9 def check(test_str): import re #http://docs.python.org/library/re.html

在Python中,如何检查字符串是否只包含某些字符

我需要检查只包含a..z、0..9和的字符串。(句号)且无其他字符

我可以迭代每个字符,检查字符是否为a..z或0..9,或。但这将是缓慢的

我现在不清楚如何用正则表达式实现它

这是正确的吗?你能推荐一个更简单的正则表达式或更有效的方法吗

#Valid chars . a-z 0-9 
def check(test_str):
    import re
    #http://docs.python.org/library/re.html
    #re.search returns None if no position in the string matches the pattern
    #pattern to search for any character other then . a-z 0-9
    pattern = r'[^\.a-z0-9]'
    if re.search(pattern, test_str):
        #Character other then . a-z 0-9 was found
        print 'Invalid : %r' % (test_str,)
    else:
        #No character other then . a-z 0-9 was found
        print 'Valid   : %r' % (test_str,)

check(test_str='abcde.1')
check(test_str='abcde.1#')
check(test_str='ABCDE.12')
check(test_str='_-/>"!@#12345abcde<')

'''
Output:
>>> 
Valid   : "abcde.1"
Invalid : "abcde.1#"
Invalid : "ABCDE.12"
Invalid : "_-/>"!@#12345abcde<"
'''
#有效字符。a-z 0-9
def检查(测试):
进口稀土
#http://docs.python.org/library/re.html
#如果字符串中没有与模式匹配的位置,则re.search返回None
#模式来搜索除此之外的任何字符。a-z 0-9
模式=r'[^\.a-z0-9]'
如果重新搜索(模式、测试):
#然后是另一个角色。找到了a-z 0-9
打印“无效:%r%”(测试字符串)
其他:
#除此之外没有其他角色。找到了a-z 0-9
打印“有效:%r%”(测试字符串)
检查(test_str='abcde.1')
检查(test_str='abcde.1#')
检查(test_str='ABCDE.12')
检查(测试str=''''.-/>“!@#12345abcde>>
有效:“abcde.1”
无效:“abcde.1#”
无效:“ABCDE.12”

无效:“-/>”!@12345abcde这是一个简单的纯Python实现。应该在性能不重要的情况下使用它(为未来的谷歌用户提供)

用作:

>>> from stringcheck import check
>>> check("abc")
True
>>> check("ABC")
False

更简单的方法?更像蟒蛇

>>> ok = "0123456789abcdef"
>>> all(c in ok for c in "123456abc")
True
>>> all(c in ok for c in "hello world")
False

它当然不是最有效的,但它确实可读。

编辑:更改正则表达式以排除A-Z

正则表达式解决方案是迄今为止速度最快的纯python解决方案

reg=re.compile('^[a-z0-9\.]+$')
>>>reg.match('jsdlfjdsf12324..3432jsdflsdf')
True
>>> timeit.Timer("reg.match('jsdlfjdsf12324..3432jsdflsdf')", "import re; reg=re.compile('^[a-z0-9\.]+$')").timeit()
0.70509696006774902
与其他解决方案相比:

>>> timeit.Timer("set('jsdlfjdsf12324..3432jsdflsdf') <= allowed", "import string; allowed = set(string.ascii_lowercase + string.digits + '.')").timeit()
3.2119350433349609
>>> timeit.Timer("all(c in allowed for c in 'jsdlfjdsf12324..3432jsdflsdf')", "import string; allowed = set(string.ascii_lowercase + string.digits + '.')").timeit()
6.7066690921783447

根据要求,我将返回答案的另一部分。但请注意,以下接受A-Z范围

你可以用

编辑使用isalnum比set解决方案效率更高

>>> timeit.Timer("'jsdlfjdsf12324..3432jsdflsdf'.replace('.', '').isalnum()").timeit()
0.63245487213134766
timeit.Timer("u'ABC\u0131\u0661'.encode('ascii', 'replace').replace('.','').isalnum()", "import string; allowed = set(string.ascii_lowercase + string.digits + '.')").timeit()
1.5719811916351318
EDIT2 John给出了一个上述方法不起作用的例子。我通过使用encode改变了解决方案以克服这种特殊情况

test_str.replace('.', '').encode('ascii', 'replace').isalnum()
而且它仍然比set解决方案快近3倍

>>> timeit.Timer("'jsdlfjdsf12324..3432jsdflsdf'.replace('.', '').isalnum()").timeit()
0.63245487213134766
timeit.Timer("u'ABC\u0131\u0661'.encode('ascii', 'replace').replace('.','').isalnum()", "import string; allowed = set(string.ascii_lowercase + string.digits + '.')").timeit()
1.5719811916351318
在我看来,使用正则表达式是解决这个问题的最好方法

答案,包含在函数中,带有注释的交互式会话:

>>> import re
>>> def special_match(strg, search=re.compile(r'[^a-z0-9.]').search):
...     return not bool(search(strg))
...
>>> special_match("")
True
>>> special_match("az09.")
True
>>> special_match("az09.\n")
False
# The above test case is to catch out any attempt to use re.match()
# with a `$` instead of `\Z` -- see point (6) below.
>>> special_match("az09.#")
False
>>> special_match("az09.X")
False
>>>
注意:在这个答案下面有一个与使用re.match()的比较。进一步的计时显示,match()将以更长的字符串获胜;当最终答案为真时,match()的开销似乎比search()大得多;这令人费解(可能是返回MatchObject而不是None的代价)可能需要进一步搜查

==== Earlier text ====
[以前]接受的答案可能需要一些改进:

(1) 演示文稿显示为交互式Python会话的结果:

reg=re.compile('^[a-z0-9\.]+$')
>>>reg.match('jsdlfjdsf12324..3432jsdflsdf')
True
但是match()不返回
True

(2) 与match()一起使用时,模式开头的
^
是多余的,并且似乎比没有
^

(3) 对于任何重模式,都应该不假思索地培养对原始字符串的自动使用

(4) 点/句点前面的反斜杠是多余的

(5) 比OP的代码慢!

prompt>rem OP's version -- NOTE: OP used raw string!

prompt>\python26\python -mtimeit -s"t='jsdlfjdsf12324..3432jsdflsdf';import
re;reg=re.compile(r'[^a-z0-9\.]')" "not bool(reg.search(t))"
1000000 loops, best of 3: 1.43 usec per loop

prompt>rem OP's version w/o backslash

prompt>\python26\python -mtimeit -s"t='jsdlfjdsf12324..3432jsdflsdf';import
re;reg=re.compile(r'[^a-z0-9.]')" "not bool(reg.search(t))"
1000000 loops, best of 3: 1.44 usec per loop

prompt>rem cleaned-up version of accepted answer

prompt>\python26\python -mtimeit -s"t='jsdlfjdsf12324..3432jsdflsdf';import
re;reg=re.compile(r'[a-z0-9.]+\Z')" "bool(reg.match(t))"
100000 loops, best of 3: 2.07 usec per loop

prompt>rem accepted answer

prompt>\python26\python -mtimeit -s"t='jsdlfjdsf12324..3432jsdflsdf';import
re;reg=re.compile('^[a-z0-9\.]+$')" "bool(reg.match(t))"
100000 loops, best of 3: 2.08 usec per loop
(6) 会产生错误的答案!!

>>> import re
>>> bool(re.compile('^[a-z0-9\.]+$').match('1234\n'))
True # uh-oh
>>> bool(re.compile('^[a-z0-9\.]+\Z').match('1234\n'))
False

这已经得到了令人满意的回答,但对于事后遇到这一问题的人,我已经对实现这一点的几种不同方法做了一些分析。在我的情况下,我需要大写十六进制数字,因此根据需要进行修改以满足您的需要

以下是我的测试实现:

import re

hex_digits = set("ABCDEF1234567890")
hex_match = re.compile(r'^[A-F0-9]+\Z')
hex_search = re.compile(r'[^A-F0-9]')

def test_set(input):
    return set(input) <= hex_digits

def test_not_any(input):
    return not any(c not in hex_digits for c in input)

def test_re_match1(input):
    return bool(re.compile(r'^[A-F0-9]+\Z').match(input))

def test_re_match2(input):
    return bool(hex_match.match(input))

def test_re_match3(input):
    return bool(re.match(r'^[A-F0-9]+\Z', input))

def test_re_search1(input):
    return not bool(re.compile(r'[^A-F0-9]').search(input))

def test_re_search2(input):
    return not bool(hex_search.search(input))

def test_re_search3(input):
    return not bool(re.match(r'[^A-F0-9]', input))
结果如下:

50335004 function calls in 13.428 seconds Ordered by: cumulative time, function name List reduced from 20 to 8 due to restriction ncalls tottime percall cumtime percall filename:lineno(function) 10000 0.005233 0.000001 0.367360 0.000037 :1(test_re_match2) 10000 0.006248 0.000001 0.378853 0.000038 :1(test_re_match3) 10000 0.010710 0.000001 0.395770 0.000040 :1(test_re_match1) 10000 0.004578 0.000000 0.467386 0.000047 :1(test_re_search2) 10000 0.005994 0.000001 0.475329 0.000048 :1(test_re_search3) 10000 0.008100 0.000001 0.482209 0.000048 :1(test_re_search1) 10000 0.863139 0.000086 0.863139 0.000086 :1(test_set) 10000 0.007414 0.000001 9.962580 0.000996 :1(test_not_any) 13.428秒内完成50335004次函数调用 排序人:累计时间、函数名 由于限制,名单从20个减少到8个 ncalls tottime percall cumtime percall文件名:lineno(函数) 10000 0.005233 0.000001 0.367360 0.000037:1(测试匹配2) 10000 0.006248 0.000001 0.378853 0.000038:1(测试匹配3) 10000 0.010710 0.000001 0.395770 0.000040:1(测试匹配1) 10000 0.004578 0.0000000.467386 0.000047:1(测试搜索2) 10000 0.005994 0.000001 0.475329 0.000048:1(测试搜索3) 10000 0.008100 0.000001 0.482209 0.000048:1(测试搜索1) 10000 0.863139 0.000086 0.863139 0.000086:1(测试组) 10000 0.007414 0.000001 9.962580 0.000996:1(测试不存在) 其中:

nCalls调用该函数的次数 TotTime在给定函数中花费的总时间,不包括对子函数所花费的时间 Percall tottime除以nCall的商 Cumtime此功能和所有子功能所用的累计时间 PercallTime除以基元调用的商 我们真正关心的列是cumtime和percall,这显示了从函数进入到退出的实际时间

如果你每次都编译正则表达式,那么不用费心去编译它会更快。编译一次比每次快7.5%,但编译比不编译只快2.5%

测试集的速度是搜索集的两倍,是匹配集的三倍

test_not_any比test_set慢整整一个数量级


TL;DR:当需要比较hm…组数据时,使用re.match或re.search

使用python集。字符串可以很快地表示为字符集。这里我测试字符串是否允许电话号码。第一个字符串允许,第二个不允许。工作快速且简单

In [17]: timeit.Timer("allowed = set('0123456789+-() ');p = set('+7(898) 64-901-63 ');p.issubset(allowed)").timeit()

Out[17]: 0.8106249139964348

In [18]: timeit.Timer("allowed = set('0123456789+-() ');p = set('+7(950) 64-901-63 фыв');p.issubset(allowed)").timeit()

Out[18]: 0.9240323599951807

如果可以避免,请不要使用regexp。

另一种方法,因为在我的例子中,我还需要检查它是否包含某些单词(如本例中的“test”),而不仅仅是字符:

input_string = 'abc test'
input_string_test = input_string
allowed_list = ['a', 'b', 'c', 'test', ' ']

for allowed_list_item in allowed_list:
    input_string_test = input_string_test.replace(allowed_list_item, '')

if not input_string_test:
    # test passed
因此,允许的字符串(char或word)是
import cProfile
import pstats
import random

# generate a list of 10000 random hex strings between 10 and 10009 characters long
# this takes a little time; be patient
tests = [ ''.join(random.choice("ABCDEF1234567890") for _ in range(l)) for l in range(10, 10010) ]

# set up profiling, then start collecting stats
test_pr = cProfile.Profile(timeunit=0.000001)
test_pr.enable()

# run the test functions against each item in tests. 
# this takes a little time; be patient
for t in tests:
    for tf in [test_set, test_not_any, 
               test_re_match1, test_re_match2, test_re_match3,
               test_re_search1, test_re_search2, test_re_search3]:
        _ = tf(t)

# stop collecting stats
test_pr.disable()

# we create our own pstats.Stats object to filter 
# out some stuff we don't care about seeing
test_stats = pstats.Stats(test_pr)

# normally, stats are printed with the format %8.3f, 
# but I want more significant digits
# so this monkey patch handles that
def _f8(x):
    return "%11.6f" % x

def _print_title(self):
    print('   ncalls     tottime     percall     cumtime     percall', end=' ', file=self.stream)
    print('filename:lineno(function)', file=self.stream)

pstats.f8 = _f8
pstats.Stats.print_title = _print_title

# sort by cumulative time (then secondary sort by name), ascending
# then print only our test implementation function calls:
test_stats.sort_stats('cumtime', 'name').reverse_order().print_stats("test_*")
50335004 function calls in 13.428 seconds Ordered by: cumulative time, function name List reduced from 20 to 8 due to restriction ncalls tottime percall cumtime percall filename:lineno(function) 10000 0.005233 0.000001 0.367360 0.000037 :1(test_re_match2) 10000 0.006248 0.000001 0.378853 0.000038 :1(test_re_match3) 10000 0.010710 0.000001 0.395770 0.000040 :1(test_re_match1) 10000 0.004578 0.000000 0.467386 0.000047 :1(test_re_search2) 10000 0.005994 0.000001 0.475329 0.000048 :1(test_re_search3) 10000 0.008100 0.000001 0.482209 0.000048 :1(test_re_search1) 10000 0.863139 0.000086 0.863139 0.000086 :1(test_set) 10000 0.007414 0.000001 9.962580 0.000996 :1(test_not_any)
In [17]: timeit.Timer("allowed = set('0123456789+-() ');p = set('+7(898) 64-901-63 ');p.issubset(allowed)").timeit()

Out[17]: 0.8106249139964348

In [18]: timeit.Timer("allowed = set('0123456789+-() ');p = set('+7(950) 64-901-63 фыв');p.issubset(allowed)").timeit()

Out[18]: 0.9240323599951807
input_string = 'abc test'
input_string_test = input_string
allowed_list = ['a', 'b', 'c', 'test', ' ']

for allowed_list_item in allowed_list:
    input_string_test = input_string_test.replace(allowed_list_item, '')

if not input_string_test:
    # test passed
import re
----
pattern = r'[^\.a-z0-9]'
result = re.fullmatch(pattern,string)
if result:
   return True
else
   return False