Python 发现两个文件之间的差异非常慢
我确实是python的初学者,但我试图比较从两个数据库提取到文件中的一些数据。在脚本中,我为每个数据库内容使用一个字典,如果发现差异,我将其添加到字典中。它们的键是前两个值(代码和子代码)的组合,该值是与该代码/子代码组合关联的长代码列表。总的来说,我的脚本可以工作,但如果它的结构糟糕且效率低下,我也不会感到惊讶。正在处理的示例数据如下所示:Python 发现两个文件之间的差异非常慢,python,performance,comparison,Python,Performance,Comparison,我确实是python的初学者,但我试图比较从两个数据库提取到文件中的一些数据。在脚本中,我为每个数据库内容使用一个字典,如果发现差异,我将其添加到字典中。它们的键是前两个值(代码和子代码)的组合,该值是与该代码/子代码组合关联的长代码列表。总的来说,我的脚本可以工作,但如果它的结构糟糕且效率低下,我也不会感到惊讶。正在处理的示例数据如下所示: 0,0,83 0,1,157 1,1,158 1,2,159 1,3,210 2,0,211 2,1,212 2,2,213 2,2,214 2,2,21
0,0,83
0,1,157
1,1,158
1,2,159
1,3,210
2,0,211
2,1,212
2,2,213
2,2,214
2,2,215
这个想法是数据应该是同步的,但有时不同步,我正在尝试检测差异。实际上,当我从DBs中提取数据时,每个文件中有超过100万行。性能并没有那么好(也许它能做到最好?),需要大约35分钟来处理并给出结果。如果有任何改进绩效的建议,我将欣然接受
import difflib, sys, csv, collections
masterDb = collections.OrderedDict()
slaveDb = collections.OrderedDict()
with open('masterDbCodes.lst','r') as f1, open('slaveDbCodes.lst','r') as f2:
diff = difflib.ndiff(f1.readlines(),f2.readlines())
for line in diff:
if line.startswith('-'):
line = line[2:]
codeSubCode = ",".join(line.split(",", 2)[:2])
longCode = ",".join(line.split(",", 2)[2:]).rstrip()
if not codeSubCode in masterDb:
masterDb[codeSubCode] = [(longCode)]
else:
masterDb[codeSubCode].append(longCode)
elif line.startswith('+'):
line = line[2:]
codeSubCode = ",".join(line.split(",", 2)[:2])
longCode = ",".join(line.split(",", 2)[2:]).rstrip()
if not codeSubCode in slaveDb:
slaveDb[codeSubCode] = [(longCode)]
else:
slaveDb[codeSubCode].append(longCode)
f1.close()
f2.close()
试试这个:
import difflib, sys, csv, collections
masterDb = collections.OrderedDict()
slaveDb = collections.OrderedDict()
with open('masterDbCodes.lst','r') as f1, open('slaveDbCodes.lst','r') as f2:
diff = difflib.ndiff(f1.readlines(),f2.readlines())
for line in diff:
if line.startswith('-'):
line = line[2:]
sp=",".join(line.split(",", 2)[:2])
codeSubCode = sp
longCode = sp.rstrip()
try:
masterDb[codeSubCode].append(longCode)
except:
masterDb[codeSubCode] = [(longCode)]
elif line.startswith('+'):
line = line[2:]
sp=",".join(line.split(",", 2)[:2])
codeSubCode = sp
longCode = sp.rstrip()
try:
slaveDb[codeSubCode].append(longCode)
except:
slaveDb[codeSubCode] = [(longCode)]
f1.close()
f2.close()
所以我最终使用了不同的逻辑来设计一个更高效的脚本。非常感谢你的帮助
#!/usr/bin/python
import csv, sys, collections
masterDb = collections.OrderedDict()
slaveDb = collections.OrderedDict()
outFile = open('results.csv', 'wb')
#First find entries in SLAVE that dont match MASTER
with open('masterDbCodes.lst', 'rb') as master:
reader1 = csv.reader(master)
master_rows = {tuple(r) for r in reader1}
with open('slaveDbCodes.lst', 'rb') as slave:
reader = csv.reader(slave)
for row in reader:
if tuple(row) not in master_rows:
code = row[0]
subCode = row[1]
codeSubCode = ",".join([code, subCode])
longCode = row[2]
if not codeSubCode in slaveDb:
slaveDb[codeSubCode] = [(longCode)]
else:
slaveDb[codeSubCode].append(longCode)
#Now find entries in MASTER that dont match SLAVE
with open('slaveDbCodes.lst', 'rb') as slave:
reader1 = csv.reader(slave)
slave_rows = {tuple(r) for r in reader1}
with open('masterDbCodes.lst', 'rb') as master:
reader = csv.reader(master)
for row in reader:
if tuple(row) not in slave_rows:
code = row[0]
subCode = row[1]
codeSubCode = ",".join([code, subCode])
longCode = row[2]
if not codeSubCode in masterDb:
masterDb[codeSubCode] = [(longCode)]
else:
masterDb[codeSubCode].append(longCode)
此解决方案可以在大约10秒内处理数据(实际上是两次) 您可能希望精确地指出更改,可能会用文字解释更改的内容和原因。我确实尝试了您的代码更改,处理时间为33分钟,因此在处理时间上有一些改进。感谢您的输入。我不知道这是否会更快,但我在开始另一个问题时定义的
ordereddefaultdict
类将允许您在这两种情况下,在xxxDb中去掉以开头的四行代码:
,并用无条件的xxxDb..append.替换它们(longCode)
。注意,您也不需要关闭这两个文件,with
将自动关闭。