Python 如何删除文件中两个重复块中的一个?

Python 如何删除文件中两个重复块中的一个?,python,regex,file-io,duplicates,Python,Regex,File Io,Duplicates,我有一个难题。我知道有那么多python大师。所以请帮帮我。我有一个很大的日志文件。格式如下所示: [text hello world yadda lines lines lines exceptions] [something i'm not interested in] [text hello world yadda lines lines lines exceptions] 等等。。。 所以区块1

我有一个难题。我知道有那么多python大师。所以请帮帮我。我有一个很大的日志文件。格式如下所示:

[text hello world yadda

          lines lines lines

          exceptions]

[something i'm not interested in]

[text hello world yadda

          lines lines lines

          exceptions]
等等。。。 所以区块1和区块3是相同的。有很多这样的案例。我的问题是如何读取此文件并仅在输出文件中写入唯一的块?如果有一个副本,应该只写一次。有时在两个重复块之间有多个块。我实际上是模式匹配,这是目前为止的代码。它只匹配模式,但对重复项不做任何处理

import re
import sys
from itertools import islice
try:
   if len(sys.argv) != 3:
      sys.exit("You should enter 3 parameters.")
   elif sys.argv[1] == sys.argv[2]:
      sys.exit("The two file names cannot be the same.")
   else:
       file = open(sys.argv[1], "r")
       file1 = open(sys.argv[2],"w")
       java_regex = re.compile(r'[java|javax|org|com]+?[\.|:]+?', re.I)  # java 
       at_regex = re.compile(r'at\s', re.I)    # at

       copy = False  # flag that control to copy or to not copy to output

       for line in file:
          if re.search(java_regex, line) and not (re.search(r'at\s', line, re.I) or re.search(r'mdcloginid:|webcontainer|c\.h\.i\.h\.p\.u\.e|threadPoolTaskExecutor|caused\sby', line, re.I)):
              # start copying if "java" is in the input
              copy = True
          else:
              if copy and not re.search(at_regex, line):
                  # stop copying if "at" is not in the input
                  copy = False

          if copy:
              file1.write(line)



       file.close()
       file1.close()

except IOError:
       sys.exit("IO error or wrong file name.")
except IndexError:
       sys.exit('\nYou must enter 3 parameters.') #prevents less than 3 inputs which is mandatory
except SystemExit as e:                       #Exception handles sys.exit()
       sys.exit(e)
我不在乎这是否必须在代码中(删除重复项)。它也可以位于单独的.py文件中。没关系 这是日志文件的原始片段:

javax.xml.ws.soap.SOAPFaultException: Uncaught BPEL fault http://schemas.xmlsoap.org/soap/envelope/:Server     
    at org.apache.axis2.jaxws.marshaller.impl.alt.MethodMarshallerUtils.createSystemException(MethodMarshallerUtils.java:1326) ~[org.apache.axis2.jar:na]
    at org.apache.axis2.jaxws.marshaller.impl.alt.MethodMarshallerUtils.demarshalFaultResponse(MethodMarshallerUtils.java:1052) ~[org.apache.axis2.jar:na]
    at org.apache.axis2.jaxws.marshaller.impl.alt.DocLitBareMethodMarshaller.demarshalFaultResponse(DocLitBareMethodMarshaller.java:415) ~[org.apache.axis2.jar:na]
    at org.apache.axis2.jaxws.client.proxy.JAXWSProxyHandler.getFaultResponse(JAXWSProxyHandler.java:597) ~[org.apache.axis2.jar:na]
    at org.apache.axis2.jaxws.client.proxy.JAXWSProxyHandler.createResponse(JAXWSProxyHandler.java:537) ~[org.apache.axis2.jar:na]
    at org.apache.axis2.jaxws.client.proxy.JAXWSProxyHandler.invokeSEIMethod(JAXWSProxyHandler.java:403) ~[org.apache.axis2.jar:na]
    at org.apache.axis2.jaxws.client.proxy.JAXWSProxyHandler.invoke(JAXWSProxyHandler.java:188) ~[org.apache.axis2.jar:na]
com.hcentive.utils.exception.HCRuntimeException: Unable to Find User Profile:null
    at com.hcentive.agent.service.AgentServiceImpl.getAgentByUserProfile(AgentServiceImpl.java:275) ~[agent-service-core-4.0.0.jar:na]
    at com.hcentive.agent.service.AgentServiceImpl$$FastClassByCGLIB$$e3caddab.invoke(<generated>) ~[cglib-2.2.jar:na]
    at net.sf.cglib.proxy.MethodProxy.invoke(MethodProxy.java:191) ~[cglib-2.2.jar:na]
    at org.springframework.aop.framework.Cglib2AopProxy$CglibMethodInvocation.invokeJoinpoint(Cglib2AopProxy.java:689) ~[spring-aop-3.1.2.RELEASE.jar:3.1.2.RELEASE]
    at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:150) ~[spring-aop-3.1.2.RELEASE.jar:3.1.2.RELEASE]
    at org.springframework.transaction.interceptor.TransactionInterceptor.invoke(TransactionInterceptor.java:110) ~[spring-tx-3.1.2.RELEASE.jar:3.1.2.RELEASE]
    at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:172) ~[spring-aop-3.1.2.RELEASE.jar:3.1.2.RELEASE]
    at org.springframework.security.access.intercept.aopalliance.MethodSecurityInterceptor.invoke(MethodSecurityInterceptor.java:64) ~[spring-security-core-3.1.2.RELEASE.jar:3.1.2.RELEASE]
javax.xml.ws.soap.SOAPFaultException: Uncaught BPEL fault http://schemas.xmlsoap.org/soap/envelope/:Server      
    at org.apache.axis2.jaxws.marshaller.impl.alt.MethodMarshallerUtils.createSystemException(MethodMarshallerUtils.java:1326) ~[org.apache.axis2.jar:na]
    at org.apache.axis2.jaxws.marshaller.impl.alt.MethodMarshallerUtils.demarshalFaultResponse(MethodMarshallerUtils.java:1052) ~[org.apache.axis2.jar:na]
    at org.apache.axis2.jaxws.marshaller.impl.alt.DocLitBareMethodMarshaller.demarshalFaultResponse(DocLitBareMethodMarshaller.java:415) ~[org.apache.axis2.jar:na]
    at org.apache.axis2.jaxws.client.proxy.JAXWSProxyHandler.getFaultResponse(JAXWSProxyHandler.java:597) ~[org.apache.axis2.jar:na]
    at org.apache.axis2.jaxws.client.proxy.JAXWSProxyHandler.createResponse(JAXWSProxyHandler.java:537) ~[org.apache.axis2.jar:na]
    at org.apache.axis2.jaxws.client.proxy.JAXWSProxyHandler.invokeSEIMethod(JAXWSProxyHandler.java:403) ~[org.apache.axis2.jar:na]  



And so on and on....
javax.xml.ws.soap.SOAPFaultException:未捕获的BPEL错误http://schemas.xmlsoap.org/soap/envelope/:Server     
在org.apache.axis2.jaxws.marshaller.impl.alt.MethodMarshallerUtils.createSystemException(MethodMarshallerUtils.java:1326)~[org.apache.axis2.jar:na]
在org.apache.axis2.jaxws.marshaller.impl.alt.MethodMarshallerUtils.demarshalFaultResponse(MethodMarshallerUtils.java:1052)~[org.apache.axis2.jar:na]
在org.apache.axis2.jaxws.marshaller.impl.alt.DocLitBareMethodMarshaller.demarshalFaultResponse(DocLitBareMethodMarshaller.java:415)~[org.apache.axis2.jar:na]
在org.apache.axis2.jaxws.client.proxy.JAXWSProxyHandler.getFaultResponse(JAXWSProxyHandler.java:597)~[org.apache.axis2.jar:na]
在org.apache.axis2.jaxws.client.proxy.JAXWSProxyHandler.createResponse(JAXWSProxyHandler.java:537)~[org.apache.axis2.jar:na]
在org.apache.axis2.jaxws.client.proxy.JAXWSProxyHandler.invokeSEIMethod(JAXWSProxyHandler.java:403)~[org.apache.axis2.jar:na]
在org.apache.axis2.jaxws.client.proxy.JAXWSProxyHandler.invoke(JAXWSProxyHandler.java:188)~[org.apache.axis2.jar:na]
com.hcentive.utils.exception.HCRuntimeException:找不到用户配置文件:null
在com.hcentive.agent.service.AgentServiceImpl.getAgentByUserProfile(AgentServiceImpl.java:275)~[agent-service-core-4.0.0.jar:na]
在com.hcentive.agent.service.AgentServiceImpl$$FastClassByCGLIB$$e3caddab.invoke()~[cglib-2.2.jar:na]
在net.sf.cglib.proxy.MethodProxy.invoke(MethodProxy.java:191)~[cglib-2.2.jar:na]
在org.springframework.aop.framework.Cglib2AopProxy$CglibMethodInvocation.invokeJoinpoint(Cglib2AopProxy.java:689)~[spring-aop-3.1.2.RELEASE.jar:3.1.2.RELEASE]
在org.springframework.aop.framework.ReflectiveMethodInvocation.procedue(ReflectiveMethodInvocation.java:150)~[spring-aop-3.1.2.RELEASE.jar:3.1.2.RELEASE]
在org.springframework.transaction.interceptor.TransactionInterceptor.invoke(TransactionInterceptor.java:110)~[spring-tx-3.1.2.RELEASE.jar:3.1.2.RELEASE]
在org.springframework.aop.framework.ReflectiveMethodInvocation.procedue(ReflectiveMethodInvocation.java:172)~[spring-aop-3.1.2.RELEASE.jar:3.1.2.RELEASE]
在org.springframework.security.access.intercept.aopalliance.MethodSecurityInterceptor.invoke(MethodSecurityInterceptor.java:64)~[spring-security-core-3.1.2.RELEASE.jar:3.1.2.RELEASE]
javax.xml.ws.soap.SOAPFaultException:未捕获的BPEL错误http://schemas.xmlsoap.org/soap/envelope/:Server      
在org.apache.axis2.jaxws.marshaller.impl.alt.MethodMarshallerUtils.createSystemException(MethodMarshallerUtils.java:1326)~[org.apache.axis2.jar:na]
在org.apache.axis2.jaxws.marshaller.impl.alt.MethodMarshallerUtils.demarshalFaultResponse(MethodMarshallerUtils.java:1052)~[org.apache.axis2.jar:na]
在org.apache.axis2.jaxws.marshaller.impl.alt.DocLitBareMethodMarshaller.demarshalFaultResponse(DocLitBareMethodMarshaller.java:415)~[org.apache.axis2.jar:na]
在org.apache.axis2.jaxws.client.proxy.JAXWSProxyHandler.getFaultResponse(JAXWSProxyHandler.java:597)~[org.apache.axis2.jar:na]
在org.apache.axis2.jaxws.client.proxy.JAXWSProxyHandler.createResponse(JAXWSProxyHandler.java:537)~[org.apache.axis2.jar:na]
在org.apache.axis2.jaxws.client.proxy.JAXWSProxyHandler.invokeSEIMethod(JAXWSProxyHandler.java:403)~[org.apache.axis2.jar:na]
等等。。。。

您可以使用hashlib中的哈希算法和如下所示的字典:{123456789:True} 这个值并不重要,但是如果是一个大文件,dict会使它比列表快很多


不管怎样,只要不在字典中,您就可以对每个块进行散列,并将其存储在字典中。如果它在字典中,则忽略该块。这是假设您的块的结构完全相同。

您可以使用以下方法删除重复的块:

import re
yourstr = r'''
[text hello world yadda

      lines lines lines

      exceptions]

[something i'm not interested in]

[text hello world yadda

      lines lines lines

      exceptions]
'''
pat = re.compile(r'\[([^]]+])(?=.*\[\1)', re.DOTALL)
result = pat.sub('', yourstr)
请注意,仅保留最后一个块,如果要保留第一个块,则必须反转字符串并使用此模式:

 (][^[]+)\[(?=.*\1\[)

然后再次反转字符串

@sagarnildass:你可以很容易地改编这个例子。很抱歉误导你,我的文件不是以“Block”开头的。我这么做只是为了方便。实际上我不是python方面的专家。因此,如果您能修改我的代码以显示代码的确切位置,我将非常高兴。现在请编辑代码,因为您知道块不是以“块”开头的。非常感谢!它说没有称为gsub的方法。还有,我应该用什么来代替你的str?@sagarnildass:对不起,这是
sub
而不是
gsub
,我的错误。@sagarnildass:如果你逐行处理一个文件,我给出的代码就不能工作,它只能处理目标文件的全部内容。我用你的示例字符串编辑了我的文章来说明这个事实。如果你想要一个更适合你的代码,请发布一个真实的日志样本(编辑你的帖子)。这是我唯一需要的。谢谢!你能修改我的代码来表达你的想法吗?我是新手。好吧,如果我理解得很好,您希望获得第一行和以
com.hcentive…
开头的行,但不是第二行
javax.xml.ws…
beca