用re进行Python大文件解析_Python_Regex_File

用re进行Python大文件解析

python regex file

用re进行Python大文件解析,python,regex,file,Python,Regex,File,如何使用正则表达式解析一个大文件（使用re模块），而不将整个文件加载到字符串（或内存）中？内存映射文件没有帮助，因为它们的内容无法转换为某种惰性字符串。re模块仅支持字符串作为内容参数 #include <boost/format.hpp> #include <boost/iostreams/device/mapped_file.hpp> #include <boost/regex.hpp> #include <iostream> int mai

如何使用正则表达式解析一个大文件（使用

re

模块），而不将整个文件加载到字符串（或内存）中？内存映射文件没有帮助，因为它们的内容无法转换为某种惰性字符串。

re

模块仅支持字符串作为内容参数

#include <boost/format.hpp>
#include <boost/iostreams/device/mapped_file.hpp>
#include <boost/regex.hpp>
#include <iostream>

int main(int argc, char* argv[])
{
    boost::iostreams::mapped_file fl("BigFile.log");
    //boost::regex expr("\\w+>Time Elapsed .*?$", boost::regex::perl);
    boost::regex expr("something usefull");
    boost::match_flag_type flags = boost::match_default;
    boost::iostreams::mapped_file::iterator start, end;
    start = fl.begin();
    end = fl.end();
    boost::match_results<boost::iostreams::mapped_file::iterator> what;
    while(boost::regex_search(start, end, what, expr))
    {
        std::cout<<what[0].str()<<std::endl;
        start = what[0].second;
    }
    return 0;
}

#包括
#包括
#包括
#包括
int main（int argc，char*argv[]）
{
boost:：iostreams:：映射的_文件fl（“BigFile.log”）；
//boost:：regex expr（“\\w+>经过的时间。*？$”，boost:：regex:：perl）；
boost：：regex expr（“有用的东西”）；
boost:：match\u flag\u type flags=boost:：match\u default；
boost:：iostreams:：mapped_file:：迭代器开始、结束；
start=fl.begin（）；
end=fl.end（）；
匹配结果是什么；
while（boost:：regex_搜索（开始、结束、内容、表达式））
{
std：：cout这取决于你在做什么样的解析
如果正在进行的分析是按行进行的，则可以使用以下命令迭代文件的行：
with open("/some/path") as f:
    for line in f:
        parse(line)

否则，您需要使用诸如分块之类的方法，一次读取分块并对其进行解析。显然，这需要更加小心，以防您试图匹配的内容与分块边界重叠。
要详细说明Julian的解决方案，您可以实现分块（如果您想要执行多行正则表达式）通过存储和连接连续行，如下所示：
list_prev_lines = []
for i in range(N):
    list_prev_lines.append(f.readline())
for line in f:
    list_prev_lines.pop(0)
    list_prev_lines.append(line)
    parse(string.join(list_prev_lines))

这将保留前N行（包括当前行）的运行列表，然后将多行组解析为单个字符串。
现在一切正常（Python 3.2.3在界面上与Python 2.7有一些不同）。搜索模式应只加上前缀b“以获得一个有效的解决方案（在Python 3.2.3中）
除非您需要多行正则表达式，否则请逐行解析文件。如果您重新表述您拥有的内容以及您想要实现的目标，可能会给我们提供更好的机会提出建议，除非您坚持使用特定的方法。感谢您，我正在搜索流中的模式，而不检查行边界。是的，但我没有知道需要多少行（一般情况下），实际上这种情况只是将整个文件读取到内存中的子情况。相反，我希望有一个使用内存映射文件的通用解决方案（因为易于使用且效率很高）。这很好，因为它允许使用多行正则表达式。
import re
import mmap
import pprint

def ParseFile(fileName):
    f = open(fileName, "r")
    print("File opened succesfully")
    m = mmap.mmap(f.fileno(), 0, access = mmap.ACCESS_READ)
    print("File mapped succesfully")
    items = re.finditer(b"\\w+>Time Elapsed .*?\n", m)
    for item in items:
        pprint.pprint(item.group(0))

if __name__ == "__main__":
    ParseFile("testre")