Python 用正则表达式解析_Python_Regex

Python 用正则表达式解析

python regex

Python 用正则表达式解析,python,regex,Python,Regex,我试图计算一个文件包含的行数，该文件如下所示： -StartACheck ---Lines-- -EndACheck -StartBCheck ---Lines-- -EndBCheck ^-Start([A-Za-z0-9]+)Check$(.*?)^-End(?:\1)Check$ 为此： count=0 z={} for line in file: s=re.search(r'\-+Start([A-Za-z0-9]+)Check',line) if s:

我试图计算一个文件包含的行数，该文件如下所示：

-StartACheck
---Lines--
-EndACheck
-StartBCheck
---Lines--
-EndBCheck

^-Start([A-Za-z0-9]+)Check$(.*?)^-End(?:\1)Check$

为此：

count=0
z={}
for line in file:
      s=re.search(r'\-+Start([A-Za-z0-9]+)Check',line)
      if s:
           e=s.group(1)
           for line in file:
               z.setdefault(e,[]).append(count)
               q=re.search(r'\-+End',line)
               if q:
                   count=0
                   break

for a,b in z.items():
    print(a,len(b))

我基本上想把ACheck、BCheck等中出现的行数存储在字典中，但我总是得到错误的输出

像这样的

A,15
B,9

ET/

< P>你可以考虑使用类似的：

import re
from collections import defaultdict

counts = defaultdict(int)  # zero if key doesn't exists

for line in file:
    start = re.fullmatch('^Start([AB])Check\n$', line).groups()[0]
    end = re.fullmatch('^End([AB])Check\n$', line).groups()[0]
    if start:
        curr_key = group
    elif end:
        assert curr_key == group, "ending line {} doesn't match with an opening line for {}".format(line, curr_key)
        curr_key = None
    else:  # it's a normal line
        counts[curr_key] += 1

奖励点：检测不匹配的开始-结束行+计数开始-结束行之外的行

无违约地将

else

子句替换为：

    else:  # it's a normal line
        if curr_key in counts:
            counts[curr_key] += 1
        else:
            counts[curr_key] = 1

并将

计数

定义为常规指令：

counts = {}

修复给定代码给定的代码似乎有效：

以下是一个（显然有效的）文件定义：

FILE = iter((  # generator of lines
    '-StartACheck',
    'a',
    'b',
    'c',
    '-EndACheck',
    '-StartBCheck',
    'a',
    'b',
    '-EndBCheck',
))

以下是缺失的定义：

import re
z = {}

以及提供的代码：

count=0
for line in FILE:
      s=re.search(r'\-+Start([A-Za-z0-9]+)Check',line)
      if s:
           e=s.group(1)
           for line in FILE:
               z.setdefault(e,[]).append(count)
               q=re.search(r'\-+End',line)
               if q:
                   count=0
                   break

for a,b in z.items():
    print(a,len(b))

输出为：

A 4
B 3

这是准确的，因为第一行（

StartACheck

）被计算在内：

      if s:
           e=s.group(1)
           for line in FILE:
               z.setdefault(e,[]).append(count)  # first called with the Start line

错误可能发生在文件行提取周围：如果文件被读取为：

file = tuple(open('filename.ext'))

然后源代码的double for循环针对文件的每一行在文件的每一行上迭代。例如：

filelines = (1, 2, 3, 4)
for line in filelines:
    for line in filelines:
        print(line)

并且（在本例中有效）几乎相同：

filelines = iter((1, 2, 3, 4))
for line in filelines:
    for line in filelines:
        print(line)

如果文件不是太大（比如说，小于1GB左右），我只需阅读整个内容并调用

re.findall（）

：

鉴于

-StartACheck
--- Line 1
-EndACheck
-StartBCheck
---Line 1
-EndBCheck
-StartACheck
---Line 1
---Line 2
---Line 3
-EndACheck

您可以使用多行正则表达式捕获以

-start[pattern]Check

开头和以

-end[pattern]Check

结尾的块，如下所示：

-StartACheck
---Lines--
-EndACheck
-StartBCheck
---Lines--
-EndBCheck

^-Start([A-Za-z0-9]+)Check$(.*?)^-End(?:\1)Check$

在Python中，您可以将其与

re.finditer

和

计数器相结合，如下所示：
import re
from collections import Counter
pattern=r'^-Start([A-Za-z0-9]+)Check$(.*?)^-End(?:\1)Check$'

c=Counter()
with open(fn, "r") as f:
    for m in re.finditer(pattern, f.read(), re.S | re.M):
        c+=Counter({m.group(1): len(m.group(2).splitlines())-1})

印刷品：
Counter({'A': 4, 'B': 1})

如果要将整个文件读入内存，请使用文件的mmap
，如下所示：
import re
from collections import Counter
import mmap
pattern=r'^-Start([A-Za-z0-9]+)Check$(.*?)^-End(?:\1)Check$'

c=Counter()
with open(fn, "r+") as f:
    mm=mmap.mmap(f.fileno(), 0)
    for m in re.finditer(pattern, mm, re.S | re.M):
        c+=Counter({m.group(1): len(m.group(2).splitlines())-1})

然后，操作系统将管理以适当的块读取文件，以匹配正则表达式
 哪里定义了count
？我只是在顶部定义了count为0。在遇到start或End时，只需使用布尔变量启动或停止计数即可。如果我不使用regex，则无法提取“a”和“B”。在到达组的末尾后重置count
？我无法使用defaultdict。没有defaultdict你能做些什么吗？defaultdict可以在stdlib
中找到。你应该在任何需要的地方使用它。但是，我在代码中添加了一个补丁，使其能够使用定义为dict
的counts
。也不能使用fullmatch。不管怎样，你能看看我的实现并说出什么问题吗？非常感谢，但出于某种原因，它没有。也许我遗漏了什么，也许是文件的读取，如我的答案末尾所示？