使用Python从字符串中删除与正则表达式匹配的重复行的最佳方法是什么？_Python_Regex

使用Python从字符串中删除与正则表达式匹配的重复行的最佳方法是什么？

python regex

使用Python从字符串中删除与正则表达式匹配的重复行的最佳方法是什么？,python,regex,Python,Regex,这是一个相当直接的尝试。我已经很久没有使用python了。看起来不错，但我相信我还有很多东西要学。有人告诉我我离这里有多远。需要查找模式，编写匹配的第一行，然后为匹配模式的其余连续行添加摘要消息并返回修改后的字符串只是想澄清一下…regex*Dog.*需要 Cat Dog My Dog Her Dog Mouse 返回 Cat Dog ::::: Pattern .*Dog.* repeats 2 more times. Mouse #!/usr/bin/env python # im

这是一个相当直接的尝试。我已经很久没有使用python了。看起来不错，但我相信我还有很多东西要学。有人告诉我我离这里有多远。需要查找模式，编写匹配的第一行，然后为匹配模式的其余连续行添加摘要消息并返回修改后的字符串

只是想澄清一下…regex

*Dog.*

需要

Cat
Dog
My Dog
Her Dog
Mouse

Cat
Dog
::::: Pattern .*Dog.* repeats 2 more times.
Mouse


#!/usr/bin/env python
#

import re
import types

def remove_repeats (l_string, l_regex):
   """Take a string, remove similar lines and replace with a summary message.

   l_regex accepts strings and tuples.
   """

   # Convert string to tuple.
   if type(l_regex) == types.StringType:
      l_regex = l_regex,


   for t in l_regex:
      r = ''
      p = ''
      for l in l_string.splitlines(True):
         if l.startswith('::::: Pattern'):
            r = r + l
         else:
            if re.search(t, l): # If line matches regex.
                m += 1
                if m == 1: # If this is first match in a set of lines add line to file.
                   r = r + l
                elif m > 1: # Else update the message string.
                   p = "::::: Pattern '" + t + "' repeats " + str(m-1) +  ' more times.\n'
            else:
                if p: # Write the message string if it has value.
                   r = r + p
                   p = ''
                m = 0
                r = r + l

      if p: # Write the message if loop ended in a pattern.
          r = r + p
          p = ''

      l_string = r # Reset string to modified string.

   return l_string

“重新匹配”功能似乎可以满足您的需要：

def rematcher(re_str, iterable):

    matcher= re.compile(re_str)
    in_match= 0
    for item in iterable:
        if matcher.match(item):
            if in_match == 0:
                yield item
            in_match+= 1
        else:
            if in_match > 1:
                yield "%s repeats %d more times\n" % (re_str, in_match-1)
            in_match= 0
            yield item
    if in_match > 1:
        yield "%s repeats %d more times\n" % (re_str, in_match-1)

import sys, re

for line in rematcher(".*Dog.*", sys.stdin):
    sys.stdout.write(line)

编辑在您的情况下，最后一个字符串应该是：

final_string= '\n'.join(rematcher(".*Dog.*", your_initial_string.split("\n")))

更新您的代码，使其更加有效

#!/usr/bin/env python
#

import re
import types

def remove_repeats (l_string, l_regex):
   """Take a string, remove similar lines and replace with a summary message.

   l_regex accepts strings/patterns or tuples of strings/patterns.
   """

   # Convert string/pattern to tuple.
   if not hasattr(l_regex, '__iter__'):
      l_regex = l_regex,

   ret = []
   last_regex = None
   count = 0

   for line in l_string.splitlines(True):
      if last_regex:
         # Previus line matched one of the regexes
         if re.match(last_regex, line):
            # This one does too
            count += 1
            continue  # skip to next line
         elif count > 1:
            ret.append("::::: Pattern %r repeats %d more times.\n" % (last_regex, count-1))
         count = 0
         last_regex = None

      ret.append(line)

      # Look for other patterns that could match
      for regex in l_regex:
         if re.match(regex, line):
            # Found one
            last_regex = regex
            count = 1
            break  # exit inner loop

   return ''.join(ret)

首先，与停止贪婪匹配相比，正则表达式的匹配速度要慢得多

.*Dog.*

相当于

Dog

但后者匹配更快，因为不涉及回溯。字符串越长，“Dog”越有可能出现多次，因此正则表达式引擎需要做的回溯工作就越多。事实上，*D实际上保证了回溯

也就是说，那么：

#! /usr/bin/env python

import re            # regular expressions
import fileinput    # read from STDIN or file

my_regex = '.*Dog.*'
my_matches = 0

for line in fileinput.input():
    line = line.strip()

    if re.search(my_regex, line):
        if my_matches == 0:
            print(line)
        my_matches = my_matches + 1
    else:
        if my_matches != 0:
            print('::::: Pattern %s repeats %i more times.' % (my_regex, my_matches - 1))
        print(line)
        my_matches = 0

目前还不清楚非相邻比赛会发生什么

还不清楚如果单行匹配被不匹配的行包围，会发生什么情况。将“Doggy”和“Hula”添加到输入文件中，您将获得更多匹配的消息“0”。

我将不得不阅读一些文章，以了解如何使用yield。谢谢。收益率是“回报保持状态”。好吧，算了吧。你让我开始背诵二的幂，你会在自己的一些计算中用到。我从“1”开始，你做你的事。然后你问我，“下一个？”。我说“2”。这样下去。每次你问“下一步”，我都会给出一个值。谢谢。不计算非相邻匹配项。单行匹配项不计算在内。