Python 部分正则表达式匹配
我正在询问Python中的部分正则表达式匹配 例如: 如果您有一个字符串:Python 部分正则表达式匹配,python,regex,Python,Regex,我正在询问Python中的部分正则表达式匹配 例如: 如果您有一个字符串: string = 'foo bar cat dog elephant barn yarn p n a' 和正则表达式: pattern = r'foo bar cat barn yard p n a f' 以下是事实: re.match(模式、字符串)将返回None re.search(模式、字符串)也将返回None 尽管我们都可以看到模式的第一部分与字符串的第一部分相匹配 因此,除了在字符串中搜索整个模式外,还
string = 'foo bar cat dog elephant barn yarn p n a'
和正则表达式:
pattern = r'foo bar cat barn yard p n a f'
以下是事实:
将返回re.match(模式、字符串)
None
也将返回re.search(模式、字符串)
None
因此,除了在字符串中搜索整个模式外,还有什么方法可以查看与模式匹配的字符串百分比吗?据我所知,任何正则表达式库都不可能,但如果您能够访问状态机,并一次一个字符地对其进行遍历,则是可能的 将正则表达式编译到状态机有点棘手,但是运行状态机很简单,因此您可以执行任何您想要的步骤。例如,
这可以告诉您它从“根据将来的输入可能匹配”切换到“由于冲突而不匹配”的字符数,但不能直接告诉您一个百分比(尽管我认为这不是您真正想要的)。不使用正则表达式
from difflib import SequenceMatcher
SequenceMatcher(None, string, pattern).ratio()
# => 0.7536231884057971
您甚至可以匹配单词而不是字符:
SequenceMatcher(None, string.split(), pattern.split()).ratio()
# => 0.7368421052631579
是的,可以进行部分正则表达式匹配 我一直在玩弄部分匹配的想法,在搜索过程中发现了这个Q。我找到了一种方法来做我需要的事情,我想我会把它贴在这里 这不是速度恶魔。可能只有在速度不是问题的情况下才有用 此函数用于查找正则表达式的最佳部分匹配,并返回匹配文本
>>> def partial_match(regex, string, flags=0, op=re.match):
... """
... Matches a regular expression to a string incrementally, retaining the
... best substring matched.
... :param regex: The regular expression to apply to the string.
... :param string: The target string.
... :param flags: re module flags, e.g.: `re.I`
... :param op: Either of re.match (default) or re.search.
... :return: The substring of the best partial match.
... """
... m = op(regex, string, flags)
... if m:
... return m.group(0)
... final = None
... for i in range(1, len(regex) + 1):
... try:
... m = op(regex[:i], string, flags)
... if m:
... final = m.group(0)
... except re.error:
... pass
... return final
...
测试它:
>>> partial_match(r".*l.*?iardvark", "bluebird")
'bluebi'
>>>
>>> partial_match(r"l.*?iardvark", "bluebird")
>>> # None was returned. Try again with search...
>>>
>>> partial_match(r"l.*?iardvark", "bluebird", op=re.search)
'luebi'
>>>
>>> string = 'foo bar cat dog elephant barn yarn p n a'
>>> pattern = r'foo bar cat barn yard p n a f'
>>>
>>> partial_match(pattern, string)
'foo bar cat '
>>>
>>> partial_match(r".* (zoo){1,3}ran away", "the fox at the "
... "zoozoozoozoozoo is "
... "happy")
'the fox at the zoozoozoo'
表现如预期。该算法不断尝试将尽可能多的表达式匹配到目标字符串。它将继续,直到整个表达式与目标字符串匹配,并保留最佳部分匹配
>>> import regex
>>>
>>> regex.match(r"(?:a.b.c.d){d}", "a.b.c", regex.ENHANCEMATCH).group(0)
'a.b.c'
>>> regex.match(r"(?:moo ow dog cat){d}", "moo cow house car").group(0)
'moo c'
>>> regex.match(r"(?:moo ow dog cat){d}", "moo cow house car",
... regex.ENHANCEMATCH).group(0)
...
'moo c'
>>> # ^^ the 'c' above is not what we want in the output. As you can see,
>>> # the 'fuzzy' matching is a bit different from partial matching.
>>>
>>> regex_script = """
... for t in targets:
... for e in exprs:
... m = regex.match(rf"(?:{e}){{d}}", t)
... """
>>>
>>> cprof.run(regex_script)
57912 function calls (57835 primitive calls) in 0.180 seconds
...
>>> regex_script = """
... for t in targets:
... for e in exprs:
... m = regex.match(rf"(?:{e}){{d}}", t, flags=regex.ENHANCEMATCH)
... """
>>>
>>> cprof.run(regex_script)
57904 function calls (57827 primitive calls) in 0.298 seconds
好的。现在让我们看看它到底有多慢
>>> import cProfile as cprof, random as rand, re
>>>
>>> # targets = ['lazy: that# fox~ The; little@ quick! lamb^ dog~ ',
>>> # << 999 more random strings of random length >>]
>>>
>>> words = """The; quick! brown? fox~ jumped, over. the! lazy: dog~
... Mary? had. a little- lamb, a& little@ lamb^ {was} she... and,,,
... [everywhere] that# Mary* went=, the. "lamb" was; sure() (to) be.
... """.split()
...
>>> targets = [' '.join(rand.choices(words, k=rand.randint(1, 100)))
... for _ in range(1000)]
...
>>> exprs = ['.*?&', '.*(jumped|and|;)', '.{1,100}[\\.,;&#^]', '.*?!',
... '.*?dog. .?lamb.?', '.*?@', 'lamb', 'Mary']
...
>>> partial_match_script = """
... for t in targets:
... for e in exprs:
... m = partial_match(e, t)
... """
...
>>> match_script = """
... for t in targets:
... for e in exprs:
... m = re.match(e, t)
... """
...
>>> cprof.run(match_script)
32003 function calls in 0.032 seconds
>>>
>>> cprof.run(partial_match_script)
261949 function calls (258167 primitive calls) in 0.230 seconds
性能比不带regex.ENHANCEMATCH
标志的partial_match()
解决方案稍好一些。不过,有了旗子,速度会慢一些
带有regex.BESTMATCH
标志的正则表达式在行为上可能与partial_match()
最为相似,但速度更慢:
>>> regex_script = """
... for t in targets:
... for e in exprs:
... m = regex.match(rf"(?:{e}){{d}}", t, flags=regex.BESTMATCH)
... """
>>> cprof.run(regex_script)
57912 function calls (57835 primitive calls) in 0.338 seconds
regex
也有一个partial=True
标志,但这似乎根本不像我们预期的那样起作用。模式必须完全匹配。如果希望其中的一部分是可选的,请使用?
。看看python文档()或howto()。例如,pattern=r'foo bar cat(谷仓堆场p n a f)?
I know:)。问题不是完全搜索,而是询问是否有其他方法返回百分比而不是匹配。您最终可以查看regex模块:它提供了模糊匹配功能。谢谢,这正是我想要的!伟大的模块,伟大的功能。不幸的是,这仅适用于字符串模式,而不适用于正则表达式(即具有?
、|
、()
等)。我想知道它是否由于冲突而被消耗或失败。如果是这样的话,字符串进入模式有多远?