Python 2.7 比较两个文本文件-逐行比较(包括屏蔽)-python
我正在从事一个项目,该项目涉及分析、比较和验证两个长文本——其中一些是数千行文本。这些文件确实有共同的行和模式,但总体上不同。我感兴趣的是两个文件中唯一的行。以下场景是一个很好的示例: 文件1-Python 2.7 比较两个文本文件-逐行比较(包括屏蔽)-python,python-2.7,Python 2.7,我正在从事一个项目,该项目涉及分析、比较和验证两个长文本——其中一些是数千行文本。这些文件确实有共同的行和模式,但总体上不同。我感兴趣的是两个文件中唯一的行。以下场景是一个很好的示例: 文件1- - This file is located in 3000.3422.63.34 description "the mother of all files" - City address of file is "Melbourne" - Country of file is Australia 文件
- This file is located in 3000.3422.63.34 description "the mother of all files"
- City address of file is "Melbourne"
- Country of file is Australia
文件2-
-This file is located in 3000.3422.62.89 description "the brother of all good files"
- City address of file is "Sydney"
- This file spent sometime in "Gold Coast"
- Country of file is Australia
任务是使用file1作为参考来验证file2-使用模式检查。
我想屏蔽这两个文件的共同模式(见下文)并进行比较
- This is the first file located in 3000.3422.xxxx.xxxx description "xxxx"
- City address of file is "xxxx"
- Country of file is xxxx
使用这种逻辑。第二个文件有一个唯一的行,我将它导出到一个报告函数:
- This file spent sometime in "Gold Coast"
file = open(filename,'w')
file.write("-------------------------\n")
file.write("\nONLY in FILE ONE\n")
file.write("\n-------------------------\n")
file.write(str('\n'.join(unique1)))
file.write("\n-------------------------\n")
file.write("\nONLY in FILE TWO\n")
file.write("\n-------------------------\n")
file.write(str('\n'.join(unique2)))
file.close()
我怎样才能轻松地(在两个文件上)进行动态屏蔽-感谢您的帮助?这是答案-最终我自己破解了它-:) 比较功能-是否逐行执行:
def CompareFiles(str_file1,str_file2):
'''
This function compares two long string texts and returns their
differences as two sequences of unique lines, one list for each.
'''
#reading from text file and splitting str_file into lines - delimited by "\n"
file1_lines = str_file1.split("\n")
file2_lines = str_file2.split("\n")
#unique lines to each one, store it in their respective lists
unique_file1 = []
unique_file2 = []
#unique lines in str1
for line1 in file1_lines:
if line1 !='':
if line1 not in file2_lines:
unique_file1.append(line1)
#unique lines in str2
for line2 in file2_lines:
if line2 != '':
if line2 not in file1_lines:
unique_file2.append(line2)
return unique_file1, unique_file2
使用此功能可屏蔽:
def Masker(pattern_lines, file2mask):
'''
This function masks some fields (based on the pattern_lines) with
dummy text to simplify the comparison
'''
#mask the values of all matches from the pattern_lines by a dummy data - 'xxxxxxxxxx'
for pattern in pattern_lines:
temp = pattern.findall(file2mask)
if len(temp) != 0:
for value in temp:
if isinstance(value, str):
masked_file = file2mask.replace(str(value),'x'*10)
elif isinstance(value, tuple):
for tup in value:
masked_file = file2mask.replace(str(tup),'x'*10)
return masked_file
打开以下文件:
f1 = open("file1.txt","r")
data1 = f1.read()
f1.close()
f3 = open("file2.txt","r")
data3 = f3.read()
f3.close()
创建文件夹以存储输出文件(可选):
遮罩的图案线:
pattern_lines = [
re.compile(r'\- This file is located in 3000.3422.(.*) description \"(.*)\"', re.M),
re.compile(r'\- City address of file is \"(.*)\"',re.M),
re.compile(r'\- Country of file is (.*)',re.M)
]
屏蔽这两个文件:
data1_masked = Masker(pattern_lines,data1)
data3_masked = Masker(pattern_lines,data3)
比较这两个文件并返回两个文件的唯一行
unique1, unique2 = CompareFiles(data1_masked, data3_masked)
报告-您可以将其写入函数:
- This file spent sometime in "Gold Coast"
file = open(filename,'w')
file.write("-------------------------\n")
file.write("\nONLY in FILE ONE\n")
file.write("\n-------------------------\n")
file.write(str('\n'.join(unique1)))
file.write("\n-------------------------\n")
file.write("\nONLY in FILE TWO\n")
file.write("\n-------------------------\n")
file.write(str('\n'.join(unique2)))
file.close()
最后打开比较输出文件:
webbrowser.open(filename)
这就是答案——最终我自己破解了——:) 比较功能-是否逐行执行:
def CompareFiles(str_file1,str_file2):
'''
This function compares two long string texts and returns their
differences as two sequences of unique lines, one list for each.
'''
#reading from text file and splitting str_file into lines - delimited by "\n"
file1_lines = str_file1.split("\n")
file2_lines = str_file2.split("\n")
#unique lines to each one, store it in their respective lists
unique_file1 = []
unique_file2 = []
#unique lines in str1
for line1 in file1_lines:
if line1 !='':
if line1 not in file2_lines:
unique_file1.append(line1)
#unique lines in str2
for line2 in file2_lines:
if line2 != '':
if line2 not in file1_lines:
unique_file2.append(line2)
return unique_file1, unique_file2
使用此功能可屏蔽:
def Masker(pattern_lines, file2mask):
'''
This function masks some fields (based on the pattern_lines) with
dummy text to simplify the comparison
'''
#mask the values of all matches from the pattern_lines by a dummy data - 'xxxxxxxxxx'
for pattern in pattern_lines:
temp = pattern.findall(file2mask)
if len(temp) != 0:
for value in temp:
if isinstance(value, str):
masked_file = file2mask.replace(str(value),'x'*10)
elif isinstance(value, tuple):
for tup in value:
masked_file = file2mask.replace(str(tup),'x'*10)
return masked_file
打开以下文件:
f1 = open("file1.txt","r")
data1 = f1.read()
f1.close()
f3 = open("file2.txt","r")
data3 = f3.read()
f3.close()
创建文件夹以存储输出文件(可选):
遮罩的图案线:
pattern_lines = [
re.compile(r'\- This file is located in 3000.3422.(.*) description \"(.*)\"', re.M),
re.compile(r'\- City address of file is \"(.*)\"',re.M),
re.compile(r'\- Country of file is (.*)',re.M)
]
屏蔽这两个文件:
data1_masked = Masker(pattern_lines,data1)
data3_masked = Masker(pattern_lines,data3)
比较这两个文件并返回两个文件的唯一行
unique1, unique2 = CompareFiles(data1_masked, data3_masked)
报告-您可以将其写入函数:
- This file spent sometime in "Gold Coast"
file = open(filename,'w')
file.write("-------------------------\n")
file.write("\nONLY in FILE ONE\n")
file.write("\n-------------------------\n")
file.write(str('\n'.join(unique1)))
file.write("\n-------------------------\n")
file.write("\nONLY in FILE TWO\n")
file.write("\n-------------------------\n")
file.write(str('\n'.join(unique2)))
file.close()
最后打开比较输出文件:
webbrowser.open(filename)
这些答案可能很有用:,有什么方法可以轻松地在正则表达式中进行掩蔽吗@降档据我所知,这对于正则表达式来说不是一个好的用例。使用regex轻松完成(与其他技术相反)可能是不合理的。我的意思是,它可能可以使用正则表达式来完成,但更直接的方法可能更简单、更有效。您希望通过传统的逐行比较使用正则表达式解决方案的原因是什么?也许考虑使用Python的<代码> SET()/代码>操作符的传统解决方案:我已经做了逐行比较。但产出是如此巨大,因为它标出了所有差异,即使它们属于同一类别。如果我使用上面的方法屏蔽它们,它将显著减少唯一行的数量,并且我不必修改以前的函数。在搜索之前,你知道文件中的常见文本吗?我的意思是,你会有例如
模式=[“-这个文件位于3000.3422”,“-文件的城市地址是”,“-文件的国家是澳大利亚”]
?这些答案可能很有用:,我有没有办法在regex中轻松地进行屏蔽@降档据我所知,这对于正则表达式来说不是一个好的用例。使用regex轻松完成(与其他技术相反)可能是不合理的。我的意思是,它可能可以使用正则表达式来完成,但更直接的方法可能更简单、更有效。您希望通过传统的逐行比较使用正则表达式解决方案的原因是什么?也许考虑使用Python的<代码> SET()/代码>操作符的传统解决方案:我已经做了逐行比较。但产出是如此巨大,因为它标出了所有差异,即使它们属于同一类别。如果我使用上面的方法屏蔽它们,它将显著减少唯一行的数量,并且我不必修改以前的函数。在搜索之前,你知道文件中的常见文本吗?我的意思是,您是否会有例如模式=[“-该文件位于3000.3422”,“-文件的城市地址是”,“-文件的国家是澳大利亚”]
?