如何提高大型文件上的Python迭代性能_Python_Performance_Iteration_Large Files

如何提高大型文件上的Python迭代性能

python performance

如何提高大型文件上的Python迭代性能,python,performance,iteration,large-files,Python,Performance,Iteration,Large Files,我有一个参考文件，大约9000行，具有以下结构：（索引，大小）-其中索引是唯一的，但大小可能不是唯一的 0 193532 1 10508 2 13984 3 14296 4 12572 5 12652 6 13688 7 14256 8 230172 9 16076 我有一个大约650000行的数据文件，它的结构如下：（集群，偏移量，大小）-其中偏移量是唯一的，但大小不是唯一的 446 0xdf6ad1 34572 447 0xdf8020 132484 451 0xe1871b 11044

我有一个参考文件，大约9000行，具有以下结构：（索引，大小）-其中索引是唯一的，但大小可能不是唯一的

我有一个大约650000行的数据文件，它的结构如下：（集群，偏移量，大小）-其中偏移量是唯一的，但大小不是唯一的

446 0xdf6ad1 34572
447 0xdf8020 132484
451 0xe1871b 11044
451 0xe1b394 7404
451 0xe1d12b 5892
451 0xe1e99c 5692
452 0xe20092 6224
452 0xe21a4b 5428
452 0xe23029 5104
452 0xe2455e 138136

我需要将参考文件第二列中的每个大小值与数据文件第三列中的大小值进行比较，以确定是否存在匹配项。如果存在匹配项，则输出偏移十六进制值（数据文件中的第二列）和索引值（参考文件中的第一列）。目前，我正在使用以下代码执行此操作，并将其传送到一个新文件：

#!/usr/bin/python3

import sys

ref_file = sys.argv[1]
dat_file = sys.argv[2]

with open(ref_file, 'r') as ref, open(dat_file, 'r') as dat:

    for r_line in ref:
        ref_size = r_line[r_line.find(' ') + 1:-1]

        for d_line in dat:
            dat_size = d_line[d_line.rfind(' ') + 1:-1]
            if dat_size == ref_size:
                print(d_line[d_line.find('0x') : d_line.rfind(' ')]
                      + '\t'
                      + r_line[:r_line.find(' ')])
        dat.seek(0)

典型的输出如下所示：

0x86ece1eb  0
0x16ff4628f 0
0x59b358020 0
0x27dfa8cb4 1
0x6f98eb88f 1
0x102cb10d4 2
0x18e2450c8 2
0x1a7aeed12 2
0x6cbb89262 2
0x34c8ad5   3
0x1c25c33e5 3

这很好，但对于给定的文件大小，需要大约50分钟才能完成

它已经完成了它的工作，但作为一个新手，我总是渴望学习改进编码的方法，并分享这些经验。我的问题是，我可以做哪些更改来提高此代码的性能？

您可以执行以下操作，使用字典

dic

并执行以下操作（以下是伪代码，我假设大小不会重复）

由于您是通过

大小

查找文件中的行，因此这些大小应该是任何字典数据结构中的键。在这本字典中，您需要去掉嵌套循环，它才是真正的性能杀手。此外，由于您的大小不是唯一的，因此必须使用

偏移量

索引

值列表（取决于要存储在词典中的文件）。A将帮助您避免一些笨拙的代码：

from collections import defaultdict

with open(ref_file, 'r') as ref, open(dat_file, 'r') as dat:
    dat_dic = defaultdict(list)  # maintain a list of offsets for each size
    for d_line in dat:
        _, offset, size = d_line.split()
        dat_dic[size].append(offset)

    for r_line in ref:
        index, size = r_line.split()
        for offset in dat_dic[size]:  
            # dict lookup is O(1) and not O(N) ...
            # ... as looping over the dat_file is
            print('{offset}\t{index}'.format(offset=offset, index=index))

如果输出行的顺序无关紧要，您可以考虑另一种方法，因为您的

dat\U文件

大得多，因此从中构建

defaultdict

会使用更多的RAM。

我不能确定这些大小在参考文件中是唯一的，但知道它们在数据中确实出现过多次文件我可以确定这些值是唯一的，即参考索引和数据偏移值。我将更新我的问题以澄清这一点。我假设dic只适用于唯一的大小值。是否正确？如果在插入时对

dat

文件进行排序，则可以使用二进制搜索在O（log n）中而不是当前O（n）中进行查找。这将大大加快内部循环的速度。您还可以将所有内容都保存在内存中：数据文件一旦解析为有效数组中的整数，其重量将小于10 MB。

from collections import defaultdict

with open(ref_file, 'r') as ref, open(dat_file, 'r') as dat:
    dat_dic = defaultdict(list)  # maintain a list of offsets for each size
    for d_line in dat:
        _, offset, size = d_line.split()
        dat_dic[size].append(offset)

    for r_line in ref:
        index, size = r_line.split()
        for offset in dat_dic[size]:  
            # dict lookup is O(1) and not O(N) ...
            # ... as looping over the dat_file is
            print('{offset}\t{index}'.format(offset=offset, index=index))