Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/regex/20.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 部分正则表达式匹配_Python_Regex - Fatal编程技术网

Python 部分正则表达式匹配

Python 部分正则表达式匹配,python,regex,Python,Regex,我正在询问Python中的部分正则表达式匹配 例如: 如果您有一个字符串: string = 'foo bar cat dog elephant barn yarn p n a' 和正则表达式: pattern = r'foo bar cat barn yard p n a f' 以下是事实: re.match(模式、字符串)将返回None re.search(模式、字符串)也将返回None 尽管我们都可以看到模式的第一部分与字符串的第一部分相匹配 因此,除了在字符串中搜索整个模式外,还

我正在询问Python中的部分正则表达式匹配

例如:

如果您有一个字符串:

string = 'foo bar cat dog elephant barn yarn p n a'
和正则表达式:

pattern = r'foo bar cat barn yard p n a f'
以下是事实:

  • re.match(模式、字符串)
    将返回
    None
  • re.search(模式、字符串)
    也将返回
    None
尽管我们都可以看到模式的第一部分与字符串的第一部分相匹配


因此,除了在字符串中搜索整个模式外,还有什么方法可以查看与模式匹配的字符串百分比吗?

据我所知,任何正则表达式库都不可能,但如果您能够访问状态机,并一次一个字符地对其进行遍历,则是可能的

将正则表达式编译到状态机有点棘手,但是运行状态机很简单,因此您可以执行任何您想要的步骤。例如,


这可以告诉您它从“根据将来的输入可能匹配”切换到“由于冲突而不匹配”的字符数,但不能直接告诉您一个百分比(尽管我认为这不是您真正想要的)。

不使用正则表达式

from difflib import SequenceMatcher
SequenceMatcher(None, string, pattern).ratio()
# => 0.7536231884057971
您甚至可以匹配单词而不是字符:

SequenceMatcher(None, string.split(), pattern.split()).ratio()
# => 0.7368421052631579

是的,可以进行部分正则表达式匹配

我一直在玩弄部分匹配的想法,在搜索过程中发现了这个Q。我找到了一种方法来做我需要的事情,我想我会把它贴在这里

这不是速度恶魔。可能只有在速度不是问题的情况下才有用

此函数用于查找正则表达式的最佳部分匹配,并返回匹配文本

>>> def partial_match(regex, string, flags=0, op=re.match):
...     """
...     Matches a regular expression to a string incrementally, retaining the
...     best substring matched.
...     :param regex:   The regular expression to apply to the string.
...     :param string:  The target string.
...     :param flags:   re module flags, e.g.: `re.I`
...     :param op:      Either of re.match (default) or re.search.
...     :return:        The substring of the best partial match.
...     """
...     m = op(regex, string, flags)
...     if m:
...         return m.group(0)
...     final = None
...     for i in range(1, len(regex) + 1):
...         try:
...             m = op(regex[:i], string, flags)
...             if m:
...                 final = m.group(0)
...         except re.error:
...             pass
...     return final
...     
测试它:

>>> partial_match(r".*l.*?iardvark", "bluebird")
'bluebi'
>>> 
>>> partial_match(r"l.*?iardvark", "bluebird")
>>> # None was returned. Try again with search...
>>> 
>>> partial_match(r"l.*?iardvark", "bluebird", op=re.search)
'luebi'
>>>
>>> string = 'foo bar cat dog elephant barn yarn p n a'
>>> pattern = r'foo bar cat barn yard p n a f'
>>> 
>>> partial_match(pattern, string)
'foo bar cat '
>>> 
>>> partial_match(r".* (zoo){1,3}ran away", "the fox at the "
...                                         "zoozoozoozoozoo is "
...                                         "happy")
'the fox at the zoozoozoo'
表现如预期。该算法不断尝试将尽可能多的表达式匹配到目标字符串。它将继续,直到整个表达式与目标字符串匹配,并保留最佳部分匹配

>>> import regex
>>>
>>> regex.match(r"(?:a.b.c.d){d}", "a.b.c", regex.ENHANCEMATCH).group(0)
'a.b.c'
>>> regex.match(r"(?:moo ow dog cat){d}", "moo cow house car").group(0)
'moo c'
>>> regex.match(r"(?:moo ow dog cat){d}", "moo cow house car", 
...             regex.ENHANCEMATCH).group(0)
...
'moo c'
>>> # ^^ the 'c' above is not what we want in the output. As you can see,
>>> # the 'fuzzy' matching is a bit different from partial matching.
>>>
>>> regex_script = """
... for t in targets:
...     for e in exprs:
...         m = regex.match(rf"(?:{e}){{d}}", t)
...         """
>>>
>>> cprof.run(regex_script)
         57912 function calls (57835 primitive calls) in 0.180 seconds
...
>>> regex_script = """
... for t in targets:
...     for e in exprs:
...         m = regex.match(rf"(?:{e}){{d}}", t, flags=regex.ENHANCEMATCH)
...         """
>>> 
>>> cprof.run(regex_script)
         57904 function calls (57827 primitive calls) in 0.298 seconds
好的。现在让我们看看它到底有多慢

>>> import cProfile as cprof, random as rand, re
>>>
>>> # targets = ['lazy: that# fox~ The; little@ quick! lamb^ dog~ ',
>>> #            << 999 more random strings of random length >>]
>>>
>>> words = """The; quick! brown? fox~ jumped, over. the! lazy: dog~
...            Mary? had. a little- lamb, a& little@ lamb^ {was} she... and,,, 
...            [everywhere] that# Mary* went=, the. "lamb" was; sure() (to) be.
...         """.split()
...
>>> targets = [' '.join(rand.choices(words, k=rand.randint(1, 100))) 
...            for _ in range(1000)]
...
>>> exprs   = ['.*?&', '.*(jumped|and|;)', '.{1,100}[\\.,;&#^]', '.*?!', 
...            '.*?dog. .?lamb.?', '.*?@', 'lamb', 'Mary']
...
>>> partial_match_script = """
... for t in targets:
...     for e in exprs:
...         m = partial_match(e, t)
...         """
...
>>> match_script = """
... for t in targets:
...     for e in exprs:
...         m = re.match(e, t)
...         """
... 
>>> cprof.run(match_script)
         32003 function calls in 0.032 seconds
>>>
>>> cprof.run(partial_match_script)
         261949 function calls (258167 primitive calls) in 0.230 seconds
性能比不带
regex.ENHANCEMATCH
标志的
partial_match()
解决方案稍好一些。不过,有了旗子,速度会慢一些

带有
regex.BESTMATCH
标志的正则表达式在行为上可能与
partial_match()
最为相似,但速度更慢:

>>> regex_script = """
... for t in targets:
...     for e in exprs:
...         m = regex.match(rf"(?:{e}){{d}}", t, flags=regex.BESTMATCH)
...         """
>>> cprof.run(regex_script)
         57912 function calls (57835 primitive calls) in 0.338 seconds

regex
也有一个
partial=True
标志,但这似乎根本不像我们预期的那样起作用。

模式必须完全匹配。如果希望其中的一部分是可选的,请使用
。看看python文档()或howto()。例如,
pattern=r'foo bar cat(谷仓堆场p n a f)?
I know:)。问题不是完全搜索,而是询问是否有其他方法返回百分比而不是匹配。您最终可以查看regex模块:它提供了模糊匹配功能。谢谢,这正是我想要的!伟大的模块,伟大的功能。不幸的是,这仅适用于字符串模式,而不适用于正则表达式(即具有
|
()
等)。我想知道它是否由于冲突而被消耗或失败。如果是这样的话,字符串进入模式有多远?