Python正则表达式解析流_Python_Regex_Stream

Python正则表达式解析流

python regex stream

Python正则表达式解析流,python,regex,stream,Python,Regex,Stream,有没有办法在python中的流上使用正则表达式匹配？像我不想通过获取整个字符串的值来实现这一点。我想知道是否有任何方法可以在srtream（动态）上匹配regex。是-使用getvalue方法： import cStringIO import re data = cStringIO.StringIO("some text") regex = re.compile(r"\w+") regex.match(data.getvalue()) 我也有同样的问题。第一个想法是实现一个LazyStr

有没有办法在python中的流上使用正则表达式匹配？像

我不想通过获取整个字符串的值来实现这一点。我想知道是否有任何方法可以在srtream（动态）上匹配regex。

是-使用

getvalue

方法：

import cStringIO
import re

data = cStringIO.StringIO("some text")
regex = re.compile(r"\w+")
regex.match(data.getvalue())

我也有同样的问题。第一个想法是实现一个

LazyString

类，该类的行为类似于字符串，但只从流中读取当前需要的数据（我通过重新实现

\uuuu getitem\uuuuuuuuuu

和

\uuuuuu iter\uuuuuuuuuuuu

来获取和缓冲字符，直到访问的最高位置……）

这不起作用（我从

re.match

中得到了一个“TypeError:expected string or buffer”，因此我仔细研究了标准库中

re

模块的实现

不幸的是，在流上使用正则表达式似乎是不可能的。该模块的核心是用C语言实现的，该实现希望整个输入能够立即存储在内存中（我猜主要是因为性能原因）。似乎没有简单的方法来解决这个问题

我也看过（pythonlex/YACC），但是他们的lexer在内部使用

re

，所以这并不能解决问题

一种可能是使用支持Python后端的。它使用纯python代码构造lexer，并且似乎能够对输入流进行操作。因为对我来说，这个问题并不那么重要（我不认为我的投入会很大……），我可能不会进一步调查，但可能值得一看。

这似乎是一个老问题。正如我在a上发布的，您可能希望对我的解决方案的Matcher类进行子类化，并在缓冲区中执行正则表达式匹配。查看kmp_example.py以获取模板。如果发现经典的Knuth-Morris-Pratt匹配就是您所需要的，那么您的问题现在就可以通过这个小型开源库得到解决：-）

在文件的特定情况下，如果您可以使用内存映射文件，并且如果您使用的是ByTestRing而不是Unicode，您可以将一个内存映射文件馈送到

re

，就像它是一个bytestring一样，并且它可以正常工作。这受到地址空间的限制，而不是RAM的限制，因此具有8GB RAM的64位计算机可以很好地映射32GB文件

如果你能做到这一点，这是一个非常好的选择。如果你不能，你必须求助于更混乱的选择

第三方模块（不是

re

）提供部分匹配支持，可用于构建流媒体支持。。。但它很混乱，有很多警告。像lookbehinds和

这样的东西不起作用，零宽度匹配很难正确进行，我不知道它是否能与其他高级功能

regex

正确交互，而

re

则不能。尽管如此，这似乎是最接近一个完整的解决方案

如果将

partial=True

传递给

regex.match

、

regex.fullmatch

、

regex.search

或

regex.finditer

，则除了报告完全匹配外，

regex

还将报告数据扩展后可能匹配的内容：

In [10]: regex.search(r'1234', '12', partial=True)
Out[10]: <regex.Match object; span=(0, 2), match='12', partial=True>

这和给它一个字符串是一样的，我想知道是否有任何方法可以解析一个流，而这与正则表达式的思想背道而驰。@SlientGhost:不一定。您可能希望使用正则表达式解析某些（无限）流，始终在流的当前开头进行匹配，并以迭代器的形式返回匹配（并且只使用流中匹配的字符）。@MartinStettner:如果它是一个没有反向引用的自动机理论匹配器，您可以这样做（还有一些其他的东西，比如前瞻约束）。只要RE可以编译成单个有限自动机（NFA或DFA），它就可以在一个过程中匹配东西，因此可以处理查找匹配无限流的情况。（但是Python使用PCRE，这不是自动机理论，需要前面的所有字节。）@我查看了DonalFellows，没有发现任何迹象表明PCRE算法不是基于自动机理论的。为了实现backrefs和lookaheads，当然需要维护一个内部缓冲区，但这不会阻止某种机制，比如说，某种

needmore

回调工作，（在许多情况下，与可能无限大的流大小相比，缓冲区不需要太大）。@MartinStettner：这是一些人“刚刚知道”的事情之一。基于堆栈的匹配器可以支持更丰富的语言-这就是你真正知道的-但需要一个令牌流，他们可以在其中备份。（我想这是我在CS本科时学习这些东西的结果。）研究得很好，很有趣。也许这是一个合理的选择？我刚刚找到了另一个解决方案：pexpect（）

In [10]: regex.search(r'1234', '12', partial=True)
Out[10]: <regex.Match object; span=(0, 2), match='12', partial=True>

import regex

def findall_over_file_with_caveats(pattern, file):
    # Caveats:
    # - doesn't support ^ or backreferences, and might not play well with
    #   advanced features I'm not aware of that regex provides and re doesn't.
    # - Doesn't do the careful handling that zero-width matches would need,
    #   so consider behavior undefined in case of zero-width matches.
    # - I have not bothered to implement findall's behavior of returning groups
    #   when the pattern has groups.
    # Unlike findall, produces an iterator instead of a list.

    # bytes window for bytes pattern, unicode window for unicode pattern
    # We assume the file provides data of the same type.
    window = pattern[:0]
    chunksize = 8192
    sentinel = object()

    last_chunk = False

    while not last_chunk:
        chunk = file.read(chunksize)
        if not chunk:
            last_chunk = True
        window += chunk

        match = sentinel
        for match in regex.finditer(pattern, window, partial=not last_chunk):
            if not match.partial:
                yield match.group()

        if match is sentinel or not match.partial:
            # No partial match at the end (maybe even no matches at all).
            # Discard the window. We don't need that data.
            # The only cases I can find where we do this are if the pattern
            # uses unsupported features or if we're on the last chunk, but
            # there might be some important case I haven't thought of.
            window = window[:0]
        else:
            # Partial match at the end.
            # Discard all data not involved in the match.
            window = window[match.start():]
            if match.start() == 0:
                # Our chunks are too small. Make them bigger.
                chunksize *= 2