完全匹配2个字符串,但python中存在特定字符串的位置除外
我有一个包含特定文本的主文件,比如说-完全匹配2个字符串,但python中存在特定字符串的位置除外,python,string-matching,file-mapping,Python,String Matching,File Mapping,我有一个包含特定文本的主文件,比如说- file contains x the image is of x type the user is admin the address is x 然后还有200个其他文本文件,包含如下文本- file contains xyz the image if of abc type the user is admin the address if pqrs 我需要把这些文件匹配起来。如果文件包含的文本与主文件中的文本完全相同,则结果将为真,每个文件的x都不
file contains x
the image is of x type
the user is admin
the address is x
然后还有200个其他文本文件,包含如下文本-
file contains xyz
the image if of abc type
the user is admin
the address if pqrs
我需要把这些文件匹配起来。如果文件包含的文本与主文件中的文本完全相同,则结果将为真,每个文件的x都不同,即主文件中的“x”可以是其他文件中的任何内容,结果将为真。我得出的结论是
arr=master.split('\n')
for file in files:
a=[]
file1=file.split('\n')
i=0
for line in arr:
line_list=line.split()
indx=line_list.index('x')
line_list1=line_list[:indx]+line_list[indx+1:]
st1=' '.join(line_list1)
file1_list=file1[i].split()
file1_list1=file1_list[:indx]+file1_list[indx+1:]
st2=' '.join(file1_list1)
if st1!=st2:
a.append(line)
i+=1
这是非常低效的。是否有一种方法可以将文件与主文件进行排序,并生成其他文件中的差异?我知道这不是一个真正的解决方案,但您可以通过执行以下操作来检查文件的格式是否相同:
if "the image is of" in var:
to do
通过检查其余的线路
“文件包含”
“用户是”
“地址是”
如果您正在检查的文件有效,您将能够在某种程度上验证
您可以查看此链接以了解有关此“子字符串想法”的更多信息
“通用”在这条线上是独一无二的吗?例如,如果键确实是,x
,是否保证x
不会出现在行中的其他位置?或者主文件是否有类似的内容
excluding x records and x axis values
for line in arr:
front, back = line.split(x_key)
# grab next line in input file
...
if line_list1.startswith(front) and
line_list1.endswith(back):
# process matching line
else:
# process non-matching line
if len(line) == len(line_list1):
if all(line[i] == line_list1[i] for i in len(line) ):
# Found matching lines
else:
# Advance to the next line
如果您有唯一的密钥
对于每一行,在键x
上拆分主文件。这给了你两条线,前面和后面。然后只需检查行是否从前面开始,从后面开始。差不多
excluding x records and x axis values
for line in arr:
front, back = line.split(x_key)
# grab next line in input file
...
if line_list1.startswith(front) and
line_list1.endswith(back):
# process matching line
else:
# process non-matching line
if len(line) == len(line_list1):
if all(line[i] == line_list1[i] for i in len(line) ):
# Found matching lines
else:
# Advance to the next line
看
按操作注释更新
只要x
在该行中是唯一的,您就可以轻松地进行调整。正如你在评论中提到的,你想要
excluding x records and x axis values
for line in arr:
front, back = line.split(x_key)
# grab next line in input file
...
if line_list1.startswith(front) and
line_list1.endswith(back):
# process matching line
else:
# process non-matching line
if len(line) == len(line_list1):
if all(line[i] == line_list1[i] for i in len(line) ):
# Found matching lines
else:
# Advance to the next line
我认为有一种方法可以满足您的要求。它还允许您指定在每个行上是否只允许相同的差异(将考虑您的第二个文件示例不匹配):
更新:这说明主文件和其他文件中的行不一定具有相同的顺序
from itertools import zip_longest
def get_min_diff(master_lines, to_check):
min_diff = None
match_line = None
for ln, ml in enumerate(master_lines):
diff = [w for w, m in zip_longest(ml, to_check) if w != m]
n_diffs = len(diff)
if min_diff is None or n_diffs < min_diff:
min_diff = n_diffs
match_line = ln
return min_diff, diff, match_line
def check_files(master, files):
# get lines to compare against
master_lines = []
with open(master) as mstr:
for line in mstr:
master_lines.append(line.strip().split())
matches = []
for f in files:
temp_master = list(master_lines)
diff_sizes = set()
diff_types = set()
with open(f) as checkfile:
for line in checkfile:
to_check = line.strip().split()
# find each place in current line where it differs from
# the corresponding line in the master file
min_diff, diff, match_index = get_min_diff(temp_master, to_check)
if min_diff <= 1: # acceptable number of differences
# remove corresponding line from master search space
# so we don't match the same master lines to multiple
# lines in a given test file
del temp_master[match_index]
# if it only differs in one place, keep track of what
# word was different for optional check later
if min_diff == 1:
diff_types.add(diff[0])
diff_sizes.add(min_diff)
# if you want any file where the max number of differences
# per line was 1
if max(diff_sizes) == 1:
# consider a match if there is only one difference per line
matches.append(f)
# if you instead want each file to only
# be different by the same word on each line
#if len(diff_types) == 1:
#matches.append(f)
return matches
运行时,上述代码将返回正确的文件:
In: check_files('testmaster.txt', ['test1.txt', 'test2.txt', 'test3.txt', 'test_nomatch.txt', 'test_scrambled.txt'])
Out: ['test1.txt', 'test2.txt', 'test3.txt', 'test_scrambled.txt']
谢谢你,布伦诺。是的,我可以使用上述方法验证其他文件,但即使这样,我也需要为每行编写if条件,我的主文件包含100行。我觉得如果我可以忽略“x”变量,映射这两个文件,并在其他文件中生成差异,就像notepad++比较-但忽略“x”。我知道,我从来没有使用过正则表达式,它不是用于这样的情况吗?我也从来没有太多使用过正则表达式-不过我会检查一下。谢谢。非常感谢你的修剪。有些块有多个“x”,因此需要对这些块采用不同的方法。另外,我想如果使用-if len(line)==len(line_list1)和line_list1.startswith(front)和line_list1.enswith(back),if条件会更好:-这可能是因为中间有更多的变量。感谢这种方法Glarue。有一点是,主控文件中的第1行不必是我们检查文件中的第1行,可能是主控文件的第1行可能是检查文件的第4行(但我们可以确定检查文件的第2行和第3行不在主控文件中)。我想的是,如果master的第1行不是checkfile的第1行,我们会检查checkfile的其他行,如果第1行不存在,我们会将其添加到报告中。然后对于master的line2,我们从line1在checkfile中停止匹配的点开始。如果你有更好的方法,请告诉我。