Python 用于文本处理的Regex与readline_Python_Regex_Dictionary_Readline

Python 用于文本处理的Regex与readline

python regex dictionary

Python 用于文本处理的Regex与readline,python,regex,dictionary,readline,Python,Regex,Dictionary,Readline,我有一个文本要处理（路由器输出），并从中生成有用的数据结构（字典中的键作为iface名称，值作为数据包计数）。我有两种方法来完成相同的任务。我想知道我应该使用哪一个来提高效率，对于更大的数据样本，哪一个更容易失败 Readline1从readline获取一个列表，处理输出，并将键作为接口名，值作为下三项写入字典 Readline2使用re模块并匹配它写入字典键和值的组和组中的组这些函数的input self.output如下所示： message = """ Interface 1/1\n\

我有一个文本要处理（路由器输出），并从中生成有用的数据结构（字典中的键作为iface名称，值作为数据包计数）。我有两种方法来完成相同的任务。我想知道我应该使用哪一个来提高效率，对于更大的数据样本，哪一个更容易失败

Readline1从readline获取一个列表，处理输出，并将键作为接口名，值作为下三项写入字典

Readline2使用re模块并匹配它写入字典键和值的组和组中的组

这些函数的input self.output如下所示：

message = 
"""
Interface 1/1\n\t
    input : 1234\n\t
    output : 3456\n\t
    dropped : 12\n
\n
Interface 1/2\n\t
    input : 7123\n\t
    output : 2345\n\t
    dropped : 31\n\t
"""

def ReadLine1(self):
    lines = self.output.splitlines()
    for index, line in enumerate(lines):
        if "Interface" in line:
            valuelist = []
            for i in [1,2,3]:
                valuelist.append((lines[index+i].split(':'))[1].strip())
            self.IFlist[line.split()[1]] = valuelist
    return self.IFlist

def Readline2(self):
    #print repr(self.output)
    n = re.compile(r"\n*Interface (./.)\n\s*input : ([0-9]+)\n\s*output : ([0-9]+)\n\s*dropped : ([0-9]+)",re.MULTILINE|re.DOTALL)
    blocks = self.output.split('\n\n')
    for block in blocks:
        m_object = re.match(n, block)
        self.IFlist[m_object.group(1)] = [m_object.group(i) for i in (2,3,4)]

Interface 1/2
    input : 7123
    output : 2345

    dropped : 31

您的两种方法都使用格式的特定方面来实现您试图进行的解析，如果该格式被更改/破坏，其中一种方法也可能会破坏

例如，如果在两个条目之间的空行中添加了一个空格（您看不到），则

blocks=self.output.split（'\n\n'）

将无法找到两个连续的换行符，并且正则表达式版本将错过第二个条目：

{'1/1': ['1234', '3456', '13']}

或者，如果您在

输入

和

输出

之间添加了额外的换行符，如下所示：

message = 
"""
Interface 1/1\n\t
    input : 1234\n\t
    output : 3456\n\t
    dropped : 12\n
\n
Interface 1/2\n\t
    input : 7123\n\t
    output : 2345\n\t
    dropped : 31\n\t
"""

def ReadLine1(self):
    lines = self.output.splitlines()
    for index, line in enumerate(lines):
        if "Interface" in line:
            valuelist = []
            for i in [1,2,3]:
                valuelist.append((lines[index+i].split(':'))[1].strip())
            self.IFlist[line.split()[1]] = valuelist
    return self.IFlist

def Readline2(self):
    #print repr(self.output)
    n = re.compile(r"\n*Interface (./.)\n\s*input : ([0-9]+)\n\s*output : ([0-9]+)\n\s*dropped : ([0-9]+)",re.MULTILINE|re.DOTALL)
    blocks = self.output.split('\n\n')
    for block in blocks:
        m_object = re.match(n, block)
        self.IFlist[m_object.group(1)] = [m_object.group(i) for i in (2,3,4)]

Interface 1/2
    input : 7123
    output : 2345

    dropped : 31

正则表达式

\s*

将处理额外的空间，但非正则表达式解析将假定

行[index+i].split（'：'）

有一个标记

[1]

，因此它将使用该数据引发索引器

或者，如果您在任何一行末尾添加了一些额外的空格，则正则表达式将无法看到内容后面的换行符，并且

re.match（n，lock）

将返回

None

，因此下一行将引发

AttributeError:“NoneType”对象没有属性“group”

或者，如果您将其中一个条目（不再是大写的

）的

Interface

更改为

Interface

），则正则表达式将产生与上述相同的错误，但非正则表达式将忽略该条目

当我测试它时，我发现正则表达式更容易搞乱对示例

消息的小编辑，但我还发现我使用生成器表达式和str.partition
制作的版本比这两个版本都要健壮得多：
def readline3():
    gen_lines = (line for line in self.output.splitlines()
                        if line and not line.isspace())
    try:
        while True: #ended when next() throws a StopIteration
            start,_,key = next(gen_lines).partition(" ")
            if start == "Interface":
                IFlist[key] = [next(gen_lines).rpartition(" : ")[2]
                                for _ in "123"]
    except StopIteration: # reached end of output
        return self.IFlist

这在上面提到的每种情况下都取得了成功，而且由于它所依赖的唯一方法是str.partition
，它总是返回一个3项元组，因此不会产生任何意外错误，除非self.output
是字符串以外的东西
同样，使用timeit
运行基准测试时，您的readline1
始终比readline2
快，而我的readline3
通常比readline1
稍快：
#using the default 1000000 loops using 'message'
<function readline1 at 0x100756f28>
11.225649802014232
<function readline2 at 0x1057e3950>
14.838601427007234
<function readline3 at 0x1057e39d8>
11.693351223017089

#使用默认1000000循环使用“message”
11.225649802014232
14.838601427007234
11.693351223017089
我认为这更像是一个codereview.stackexchange.com问题，因为您要求的是对代码的一般性评论，而不是具体问题。无论如何，我更喜欢第二个选项，它是一个“更高级别”描述您的问题的方式——阅读您的代码更容易知道正则表达式在做什么，而不是使用拆分和索引。此外，列表理解也更具python风格