Python 多行正则表达式匹配检索行号和匹配项_Python_Python 3.x_Regex

Python 多行正则表达式匹配检索行号和匹配项

python python-3.x regex

Python 多行正则表达式匹配检索行号和匹配项,python,python-3.x,regex,Python,Python 3.x,Regex,我试图迭代文件中的所有行，以匹配可能出现的模式发生在文件中的任何位置在同一文件中多次出现在同一行上发生多次我正在搜索的字符串可以在一个正则表达式模式的多行中分布一个例子是输入 new File() new File() there is a new File() new File() there is not a matching pattern here File() new new File() test new File() occurs twice

我试图迭代文件中的所有行，以匹配可能出现的模式

发生在文件中的任何位置

在同一文件中多次出现

在同一行上发生多次

我正在搜索的字符串可以在一个正则表达式模式的多行中分布

一个例子是输入

new File()
new
File()
there is a new File()
new
    
    
    
File()
there is not a matching pattern here File() new
new File() test new File() occurs twice on this line

new File() Found on line 1  
new File() Found on lines 2 & 3 
new File() Found on line 4 
new File() Found on lines 5 & 9 
new File() Found on line 11
new File() Found on line 11 
6 occurrences of new File() pattern in test.txt (Filename)

示例输出为：

new File()
new
File()
there is a new File()
new
    
    
    
File()
there is not a matching pattern here File() new
new File() test new File() occurs twice on this line

new File() Found on line 1  
new File() Found on lines 2 & 3 
new File() Found on line 4 
new File() Found on lines 5 & 9 
new File() Found on line 11
new File() Found on line 11 
6 occurrences of new File() pattern in test.txt (Filename)

正则表达式模式看起来像

pattern = r'new\s+File\s*\({1}\s*\){1}'

查看文档，我可以看到match、findall和finditer都在字符串的开头返回匹配项，但我看不到使用搜索函数的方法，该函数可以查找正则表达式的任何位置，其中我们正在搜索的字符串超过多行（上面我的要求中的第四个）

足够简单，可以将每行出现的多个正则表达式与匹配

输入示例：

line = "new File() new File()"

代码：

有没有一种方法可以使用Python的正则表达式来完成我想要的任务？

您可以首先找到文本中的所有

\n

字符及其各自的位置/字符索引。由于每个

\n

…以及…开始一个新行，因此此列表中每个值的索引指示找到的

\n

字符终止的行号。然后搜索所有出现的模式，并使用上述列表查找匹配的开始/结束位置

import re
import bisect

text = """new 
File()
aa new File()
new
File()
there is a new File() and new
File() again
new
    
    
    
File()
there is not a matching pattern here File() new
new File() test new File() occurs twice on this line
"""

# character indices of all \n characters in text
nl = [m.start() for m in re.finditer("\n", text, re.MULTILINE|re.DOTALL)]

matches = list(re.finditer(r"(new\s+File\(\))", text, re.MULTILINE|re.DOTALL))
match_count = 0
for m in matches:
    match_count += 1
    r = range(bisect.bisect(nl, m.start()-1), bisect.bisect(nl, m.end()-1)+1)
    print(re.sub(r"\s+", " ", m.group(1), re.DOTALL), "found on line(s)", *r)
print(f"{match_count} occurrences of new File() found in file....")

输出：

new File() found on line(s) 0 1
new File() found on line(s) 2
new File() found on line(s) 3 4
new File() found on line(s) 5
new File() found on line(s) 5 6
new File() found on line(s) 7 8 9 10 11
new File() found on line(s) 13
new File() found on line(s) 13
8 occurrences of new File() found in file....

首先可以找到文本中的所有

\n

字符及其各自的位置/字符索引。由于每个

\n

…以及…开始一个新行，因此此列表中每个值的索引指示找到的

\n

字符终止的行号。然后搜索所有出现的模式，并使用上述列表查找匹配的开始/结束位置

import re
import bisect

text = """new 
File()
aa new File()
new
File()
there is a new File() and new
File() again
new
    
    
    
File()
there is not a matching pattern here File() new
new File() test new File() occurs twice on this line
"""

# character indices of all \n characters in text
nl = [m.start() for m in re.finditer("\n", text, re.MULTILINE|re.DOTALL)]

matches = list(re.finditer(r"(new\s+File\(\))", text, re.MULTILINE|re.DOTALL))
match_count = 0
for m in matches:
    match_count += 1
    r = range(bisect.bisect(nl, m.start()-1), bisect.bisect(nl, m.end()-1)+1)
    print(re.sub(r"\s+", " ", m.group(1), re.DOTALL), "found on line(s)", *r)
print(f"{match_count} occurrences of new File() found in file....")

输出：

new File() found on line(s) 0 1
new File() found on line(s) 2
new File() found on line(s) 3 4
new File() found on line(s) 5
new File() found on line(s) 5 6
new File() found on line(s) 7 8 9 10 11
new File() found on line(s) 13
new File() found on line(s) 13
8 occurrences of new File() found in file....

可以先计算匹配前的换行数，然后计算匹配值中的换行数，然后合并行号：见：

重新导入
s='new File（）\nnew\n文件（）\n有一个新文件（）\nnew\n\n\n文件（）\n这里没有匹配的模式File（）new\nnew File（）test new File（）在此行中出现两次
pattern=r'new\s+File\s*\（\s*\）'
对于m in re.finditer（模式，s）：
linenums=[s[：m.start（）].count（'\n'）+1]
对于范围内的u（m.group（）.count（'\n'））：
linenums.append（linenums[-1]+1）
打印（{}行中找到“{}”。格式（re.sub（r'\s+”，“”，m.group（），“，”。join（map（str，linenums）））

看

输出：

在第1行找到新文件（）在第2、3行找到新文件（）在第4行找到新文件（）在第5、6、7、8、9行找到新文件（）在第11行找到新文件（）在第11行找到新文件（）

您可以计算匹配前的换行数，然后计算匹配值中的换行数，并合并行号：见：

重新导入
s='new File（）\nnew\n文件（）\n有一个新文件（）\nnew\n\n\n文件（）\n这里没有匹配的模式File（）new\nnew File（）test new File（）在此行中出现两次
pattern=r'new\s+File\s*\（\s*\）'
对于m in re.finditer（模式，s）：
linenums=[s[：m.start（）].count（'\n'）+1]
对于范围内的u（m.group（）.count（'\n'））：
linenums.append（linenums[-1]+1）
打印（{}行中找到“{}”。格式（re.sub（r'\s+”，“”，m.group（），“，”。join（map（str，linenums）））

看

输出：

@mrxa这太棒了，谢谢注意，

re.MULTILINE | re.DOTALL

在这里是多余的，因为没有

、

和

模式，它们的行为可以用这些选项修改。@mrxa这太棒了，感谢您注意，

re.MULTILINE | re.DOTALL

在这里是多余的，因为没有

、

和

模式可以使用这些选项修改其行为。