Python 部分正则表达式匹配_Python_Regex

Python 部分正则表达式匹配

python regex

Python 部分正则表达式匹配,python,regex,Python,Regex,我正在询问Python中的部分正则表达式匹配例如：如果您有一个字符串： string = 'foo bar cat dog elephant barn yarn p n a' 和正则表达式： pattern = r'foo bar cat barn yard p n a f' 以下是事实： re.match（模式、字符串）将返回None re.search（模式、字符串）也将返回None 尽管我们都可以看到模式的第一部分与字符串的第一部分相匹配因此，除了在字符串中搜索整个模式外，还

我正在询问Python中的部分正则表达式匹配

例如：

如果您有一个字符串：

string = 'foo bar cat dog elephant barn yarn p n a'

和正则表达式：

pattern = r'foo bar cat barn yard p n a f'

以下是事实：

```
re.match（模式、字符串）
```
将返回
```
None
```
```
re.search（模式、字符串）
```
也将返回
```
None
```

尽管我们都可以看到模式的第一部分与字符串的第一部分相匹配

因此，除了在字符串中搜索整个模式外，还有什么方法可以查看与模式匹配的字符串百分比吗？

据我所知，任何正则表达式库都不可能，但如果您能够访问状态机，并一次一个字符地对其进行遍历，则是可能的

将正则表达式编译到状态机有点棘手，但是运行状态机很简单，因此您可以执行任何您想要的步骤。例如,

这可以告诉您它从“根据将来的输入可能匹配”切换到“由于冲突而不匹配”的字符数，但不能直接告诉您一个百分比（尽管我认为这不是您真正想要的）。

不使用正则表达式

from difflib import SequenceMatcher
SequenceMatcher(None, string, pattern).ratio()
# => 0.7536231884057971

您甚至可以匹配单词而不是字符：

SequenceMatcher(None, string.split(), pattern.split()).ratio()
# => 0.7368421052631579

是的，可以进行部分正则表达式匹配

我一直在玩弄部分匹配的想法，在搜索过程中发现了这个Q。我找到了一种方法来做我需要的事情，我想我会把它贴在这里

这不是速度恶魔。可能只有在速度不是问题的情况下才有用

此函数用于查找正则表达式的最佳部分匹配，并返回匹配文本

>>> def partial_match(regex, string, flags=0, op=re.match):
...     """
...     Matches a regular expression to a string incrementally, retaining the
...     best substring matched.
...     :param regex:   The regular expression to apply to the string.
...     :param string:  The target string.
...     :param flags:   re module flags, e.g.: `re.I`
...     :param op:      Either of re.match (default) or re.search.
...     :return:        The substring of the best partial match.
...     """
...     m = op(regex, string, flags)
...     if m:
...         return m.group(0)
...     final = None
...     for i in range(1, len(regex) + 1):
...         try:
...             m = op(regex[:i], string, flags)
...             if m:
...                 final = m.group(0)
...         except re.error:
...             pass
...     return final
...

测试它：

>>> partial_match(r".*l.*?iardvark", "bluebird")
'bluebi'
>>> 
>>> partial_match(r"l.*?iardvark", "bluebird")
>>> # None was returned. Try again with search...
>>> 
>>> partial_match(r"l.*?iardvark", "bluebird", op=re.search)
'luebi'
>>>
>>> string = 'foo bar cat dog elephant barn yarn p n a'
>>> pattern = r'foo bar cat barn yard p n a f'
>>> 
>>> partial_match(pattern, string)
'foo bar cat '
>>> 
>>> partial_match(r".* (zoo){1,3}ran away", "the fox at the "
...                                         "zoozoozoozoozoo is "
...                                         "happy")
'the fox at the zoozoozoo'

表现如预期。该算法不断尝试将尽可能多的表达式匹配到目标字符串。它将继续，直到整个表达式与目标字符串匹配，并保留最佳部分匹配

>>> import regex
>>>
>>> regex.match(r"(?:a.b.c.d){d}", "a.b.c", regex.ENHANCEMATCH).group(0)
'a.b.c'
>>> regex.match(r"(?:moo ow dog cat){d}", "moo cow house car").group(0)
'moo c'
>>> regex.match(r"(?:moo ow dog cat){d}", "moo cow house car", 
...             regex.ENHANCEMATCH).group(0)
...
'moo c'
>>> # ^^ the 'c' above is not what we want in the output. As you can see,
>>> # the 'fuzzy' matching is a bit different from partial matching.
>>>
>>> regex_script = """
... for t in targets:
...     for e in exprs:
...         m = regex.match(rf"(?:{e}){{d}}", t)
...         """
>>>
>>> cprof.run(regex_script)
         57912 function calls (57835 primitive calls) in 0.180 seconds
...
>>> regex_script = """
... for t in targets:
...     for e in exprs:
...         m = regex.match(rf"(?:{e}){{d}}", t, flags=regex.ENHANCEMATCH)
...         """
>>> 
>>> cprof.run(regex_script)
         57904 function calls (57827 primitive calls) in 0.298 seconds

好的。现在让我们看看它到底有多慢

>>> import cProfile as cprof, random as rand, re
>>>
>>> # targets = ['lazy: that# fox~ The; little@ quick! lamb^ dog~ ',
>>> #            << 999 more random strings of random length >>]
>>>
>>> words = """The; quick! brown? fox~ jumped, over. the! lazy: dog~
...            Mary? had. a little- lamb, a& little@ lamb^ {was} she... and,,, 
...            [everywhere] that# Mary* went=, the. "lamb" was; sure() (to) be.
...         """.split()
...
>>> targets = [' '.join(rand.choices(words, k=rand.randint(1, 100))) 
...            for _ in range(1000)]
...
>>> exprs   = ['.*?&', '.*(jumped|and|;)', '.{1,100}[\\.,;&#^]', '.*?!', 
...            '.*?dog. .?lamb.?', '.*?@', 'lamb', 'Mary']
...
>>> partial_match_script = """
... for t in targets:
...     for e in exprs:
...         m = partial_match(e, t)
...         """
...
>>> match_script = """
... for t in targets:
...     for e in exprs:
...         m = re.match(e, t)
...         """
... 
>>> cprof.run(match_script)
         32003 function calls in 0.032 seconds
>>>
>>> cprof.run(partial_match_script)
         261949 function calls (258167 primitive calls) in 0.230 seconds

性能比不带

regex.ENHANCEMATCH

标志的

partial_match（）

解决方案稍好一些。不过，有了旗子，速度会慢一些

带有

regex.BESTMATCH

标志的正则表达式在行为上可能与

partial_match（）

最为相似，但速度更慢：

>>> regex_script = """
... for t in targets:
...     for e in exprs:
...         m = regex.match(rf"(?:{e}){{d}}", t, flags=regex.BESTMATCH)
...         """
>>> cprof.run(regex_script)
         57912 function calls (57835 primitive calls) in 0.338 seconds

regex

也有一个

partial=True

标志，但这似乎根本不像我们预期的那样起作用。

模式必须完全匹配。如果希望其中的一部分是可选的，请使用

？

。看看python文档（）或howto（）。例如，

pattern=r'foo bar cat（谷仓堆场p n a f）？

I know:）。问题不是完全搜索，而是询问是否有其他方法返回百分比而不是匹配。您最终可以查看regex模块：它提供了模糊匹配功能。谢谢，这正是我想要的！伟大的模块，伟大的功能。不幸的是，这仅适用于字符串模式，而不适用于正则表达式（即具有

？

、

（）

等）。我想知道它是否由于冲突而被消耗或失败。如果是这样的话，字符串进入模式有多远？