使用Python从字符串中删除与正则表达式匹配的重复行的最佳方法是什么?
这是一个相当直接的尝试。我已经很久没有使用python了。看起来不错,但我相信我还有很多东西要学。有人告诉我我离这里有多远。需要查找模式,编写匹配的第一行,然后为匹配模式的其余连续行添加摘要消息并返回修改后的字符串 只是想澄清一下…regex使用Python从字符串中删除与正则表达式匹配的重复行的最佳方法是什么?,python,regex,Python,Regex,这是一个相当直接的尝试。我已经很久没有使用python了。看起来不错,但我相信我还有很多东西要学。有人告诉我我离这里有多远。需要查找模式,编写匹配的第一行,然后为匹配模式的其余连续行添加摘要消息并返回修改后的字符串 只是想澄清一下…regex*Dog.*需要 Cat Dog My Dog Her Dog Mouse 返回 Cat Dog ::::: Pattern .*Dog.* repeats 2 more times. Mouse #!/usr/bin/env python # im
*Dog.*
需要
Cat
Dog
My Dog
Her Dog
Mouse
返回
Cat
Dog
::::: Pattern .*Dog.* repeats 2 more times.
Mouse
#!/usr/bin/env python
#
import re
import types
def remove_repeats (l_string, l_regex):
"""Take a string, remove similar lines and replace with a summary message.
l_regex accepts strings and tuples.
"""
# Convert string to tuple.
if type(l_regex) == types.StringType:
l_regex = l_regex,
for t in l_regex:
r = ''
p = ''
for l in l_string.splitlines(True):
if l.startswith('::::: Pattern'):
r = r + l
else:
if re.search(t, l): # If line matches regex.
m += 1
if m == 1: # If this is first match in a set of lines add line to file.
r = r + l
elif m > 1: # Else update the message string.
p = "::::: Pattern '" + t + "' repeats " + str(m-1) + ' more times.\n'
else:
if p: # Write the message string if it has value.
r = r + p
p = ''
m = 0
r = r + l
if p: # Write the message if loop ended in a pattern.
r = r + p
p = ''
l_string = r # Reset string to modified string.
return l_string
“重新匹配”功能似乎可以满足您的需要:
def rematcher(re_str, iterable):
matcher= re.compile(re_str)
in_match= 0
for item in iterable:
if matcher.match(item):
if in_match == 0:
yield item
in_match+= 1
else:
if in_match > 1:
yield "%s repeats %d more times\n" % (re_str, in_match-1)
in_match= 0
yield item
if in_match > 1:
yield "%s repeats %d more times\n" % (re_str, in_match-1)
import sys, re
for line in rematcher(".*Dog.*", sys.stdin):
sys.stdout.write(line)
编辑
在您的情况下,最后一个字符串应该是:
final_string= '\n'.join(rematcher(".*Dog.*", your_initial_string.split("\n")))
更新您的代码,使其更加有效
#!/usr/bin/env python
#
import re
import types
def remove_repeats (l_string, l_regex):
"""Take a string, remove similar lines and replace with a summary message.
l_regex accepts strings/patterns or tuples of strings/patterns.
"""
# Convert string/pattern to tuple.
if not hasattr(l_regex, '__iter__'):
l_regex = l_regex,
ret = []
last_regex = None
count = 0
for line in l_string.splitlines(True):
if last_regex:
# Previus line matched one of the regexes
if re.match(last_regex, line):
# This one does too
count += 1
continue # skip to next line
elif count > 1:
ret.append("::::: Pattern %r repeats %d more times.\n" % (last_regex, count-1))
count = 0
last_regex = None
ret.append(line)
# Look for other patterns that could match
for regex in l_regex:
if re.match(regex, line):
# Found one
last_regex = regex
count = 1
break # exit inner loop
return ''.join(ret)
首先,与停止贪婪匹配相比,正则表达式的匹配速度要慢得多
.*Dog.*
相当于
Dog
但后者匹配更快,因为不涉及回溯。字符串越长,“Dog”越有可能出现多次,因此正则表达式引擎需要做的回溯工作就越多。事实上,*D实际上保证了回溯
也就是说,那么:
#! /usr/bin/env python
import re # regular expressions
import fileinput # read from STDIN or file
my_regex = '.*Dog.*'
my_matches = 0
for line in fileinput.input():
line = line.strip()
if re.search(my_regex, line):
if my_matches == 0:
print(line)
my_matches = my_matches + 1
else:
if my_matches != 0:
print('::::: Pattern %s repeats %i more times.' % (my_regex, my_matches - 1))
print(line)
my_matches = 0
目前还不清楚非相邻比赛会发生什么
还不清楚如果单行匹配被不匹配的行包围,会发生什么情况。将“Doggy”和“Hula”添加到输入文件中,您将获得更多匹配的消息“0”。我将不得不阅读一些文章,以了解如何使用yield。谢谢。收益率是“回报保持状态”。好吧,算了吧。你让我开始背诵二的幂,你会在自己的一些计算中用到。我从“1”开始,你做你的事。然后你问我,“下一个?”。我说“2”。这样下去。每次你问“下一步”,我都会给出一个值。谢谢。不计算非相邻匹配项。单行匹配项不计算在内。