Warning: file_get_contents(/data/phpspider/zhask/data//catemap/0/performance/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/7/neo4j/3.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 发现两个文件之间的差异非常慢_Python_Performance_Comparison - Fatal编程技术网

Python 发现两个文件之间的差异非常慢

Python 发现两个文件之间的差异非常慢,python,performance,comparison,Python,Performance,Comparison,我确实是python的初学者,但我试图比较从两个数据库提取到文件中的一些数据。在脚本中,我为每个数据库内容使用一个字典,如果发现差异,我将其添加到字典中。它们的键是前两个值(代码和子代码)的组合,该值是与该代码/子代码组合关联的长代码列表。总的来说,我的脚本可以工作,但如果它的结构糟糕且效率低下,我也不会感到惊讶。正在处理的示例数据如下所示: 0,0,83 0,1,157 1,1,158 1,2,159 1,3,210 2,0,211 2,1,212 2,2,213 2,2,214 2,2,21

我确实是python的初学者,但我试图比较从两个数据库提取到文件中的一些数据。在脚本中,我为每个数据库内容使用一个字典,如果发现差异,我将其添加到字典中。它们的键是前两个值(代码和子代码)的组合,该值是与该代码/子代码组合关联的长代码列表。总的来说,我的脚本可以工作,但如果它的结构糟糕且效率低下,我也不会感到惊讶。正在处理的示例数据如下所示:

0,0,83
0,1,157
1,1,158
1,2,159
1,3,210
2,0,211
2,1,212
2,2,213
2,2,214
2,2,215
这个想法是数据应该是同步的,但有时不同步,我正在尝试检测差异。实际上,当我从DBs中提取数据时,每个文件中有超过100万行。性能并没有那么好(也许它能做到最好?),需要大约35分钟来处理并给出结果。如果有任何改进绩效的建议,我将欣然接受

import difflib, sys, csv, collections

masterDb = collections.OrderedDict()
slaveDb = collections.OrderedDict()
with open('masterDbCodes.lst','r') as f1, open('slaveDbCodes.lst','r') as f2:
    diff = difflib.ndiff(f1.readlines(),f2.readlines())
    for line in diff:
        if line.startswith('-'):
            line = line[2:]
            codeSubCode = ",".join(line.split(",", 2)[:2])
            longCode = ",".join(line.split(",", 2)[2:]).rstrip()
            if not codeSubCode in masterDb:
                masterDb[codeSubCode] = [(longCode)]
            else:
                masterDb[codeSubCode].append(longCode)
        elif line.startswith('+'):
            line = line[2:]
            codeSubCode = ",".join(line.split(",", 2)[:2])
            longCode = ",".join(line.split(",", 2)[2:]).rstrip()
            if not codeSubCode in slaveDb:
                slaveDb[codeSubCode] = [(longCode)]
            else:
                slaveDb[codeSubCode].append(longCode)

f1.close()
f2.close()
试试这个:

import difflib, sys, csv, collections

masterDb = collections.OrderedDict()
slaveDb = collections.OrderedDict()
with open('masterDbCodes.lst','r') as f1, open('slaveDbCodes.lst','r') as f2:
    diff = difflib.ndiff(f1.readlines(),f2.readlines())
    for line in diff:
        if line.startswith('-'):
            line = line[2:]
            sp=",".join(line.split(",", 2)[:2])
            codeSubCode = sp
            longCode = sp.rstrip()
            try:
                masterDb[codeSubCode].append(longCode)
            except:
                masterDb[codeSubCode] = [(longCode)]
        elif line.startswith('+'):
            line = line[2:]
            sp=",".join(line.split(",", 2)[:2])
            codeSubCode = sp
            longCode = sp.rstrip()               
            try:
                slaveDb[codeSubCode].append(longCode)
            except:
                slaveDb[codeSubCode] = [(longCode)]

f1.close()
f2.close()

所以我最终使用了不同的逻辑来设计一个更高效的脚本。非常感谢你的帮助

#!/usr/bin/python

import csv, sys, collections

masterDb = collections.OrderedDict()
slaveDb = collections.OrderedDict()
outFile = open('results.csv', 'wb')

#First find entries in SLAVE that dont match MASTER
with open('masterDbCodes.lst', 'rb') as master:
    reader1 = csv.reader(master)
    master_rows = {tuple(r) for r in reader1}

with open('slaveDbCodes.lst', 'rb') as slave:
    reader = csv.reader(slave)

    for row in reader:
        if tuple(row) not in master_rows:
            code = row[0]
            subCode = row[1]
            codeSubCode = ",".join([code, subCode])
            longCode = row[2]
            if not codeSubCode in slaveDb:
                slaveDb[codeSubCode] = [(longCode)]
            else:
                slaveDb[codeSubCode].append(longCode)

#Now find entries in MASTER that dont match SLAVE
with open('slaveDbCodes.lst', 'rb') as slave:
    reader1 = csv.reader(slave)
    slave_rows = {tuple(r) for r in reader1}

with open('masterDbCodes.lst', 'rb') as master:
    reader = csv.reader(master)

    for row in reader:
        if tuple(row) not in slave_rows:
            code = row[0]
            subCode = row[1]
            codeSubCode = ",".join([code, subCode])
            longCode = row[2]
            if not codeSubCode in masterDb:
                masterDb[codeSubCode] = [(longCode)]
            else:
                masterDb[codeSubCode].append(longCode)

此解决方案可以在大约10秒内处理数据(实际上是两次)

您可能希望精确地指出更改,可能会用文字解释更改的内容和原因。我确实尝试了您的代码更改,处理时间为33分钟,因此在处理时间上有一些改进。感谢您的输入。我不知道这是否会更快,但我在开始另一个问题时定义的
ordereddefaultdict
类将允许您在这两种情况下,在xxxDb中去掉以
开头的四行代码:
,并用无条件的
xxxDb..append.替换它们(longCode)
。注意,您也不需要关闭这两个文件,
with
将自动关闭。