用python加载大型文本文件_Python

用python加载大型文本文件

python

用python加载大型文本文件,python,Python,我需要处理一个大的文本文件（4GB）。其数据如下： 12 23 34 22 78 98 76 56 77 在那里，我需要阅读每一行，并根据这些行做一些工作。目前我的工作是： sample = 'filename.txt' with open(sample) as f: for line in f: line = line.split() line = [int(i) for i in line] a = line[0] b = lin

我需要处理一个大的文本文件（4GB）。其数据如下：

12 23 34
22 78 98
76 56 77

在那里，我需要阅读每一行，并根据这些行做一些工作。目前我的工作是：

sample = 'filename.txt'

with open(sample) as f:
    for line in f:
      line = line.split() 
      line = [int(i) for i in line]
      a = line[0]
      b = line[1]
      c = line[2]
      do_someprocess()

这需要花费大量的时间来执行。在python中还有其他更好的方法吗

split（）

返回一个列表。然后您试图通过以下方式访问第一、第二和第三个元素：

line = [int(i) for i in line]
  a = line[0]
  b = line[1]
  c = line[2]

相反，你可以直接说

a，b，c=line.split（）

，然后

将包含

line[0]

，

将包含

line[1]

，

将包含

line[2]

。这会节省你一些时间

with open(sample) as f:
    for line in f:
      a,b,c = line.split() 
      do_someprocess()

例如：

with open("sample.txt","r") as f:
    for line in f:
        a,b,c = line.split()
        print a,b,c

.txt文件

12 34 45
78 67 45

输出：

12 34 45
78 67 45

8.94879606286   ## seconds to complete 100000 times.

编辑：我想对此进行详细说明。我使用了

timeit（）

模块来比较代码运行所花费的时间。如果我在这里做错了什么，请告诉我。下面是编写代码的最佳方式

v = """ with open("sample.txt","r") as f:
    for line in f:
      line = line.split() 
      line = [int(i) for i in line]
      a = line[0]
      b = line[1]
      c = line[2]"""
import timeit
print timeit.timeit(stmt=v, number=100000)

s = """ with open("sample.txt","r") as f:
            for line in f:
                a,b,c = [int(s) for s in line.split()]"""

import timeit
print timeit.timeit(stmt=s, number=100000)

输出：

12 34 45
78 67 45

8.94879606286   ## seconds to complete 100000 times.

下面是我编写代码的方法

v = """ with open("sample.txt","r") as f:
    for line in f:
      line = line.split() 
      line = [int(i) for i in line]
      a = line[0]
      b = line[1]
      c = line[2]"""
import timeit
print timeit.timeit(stmt=v, number=100000)

s = """ with open("sample.txt","r") as f:
            for line in f:
                a,b,c = [int(s) for s in line.split()]"""

import timeit
print timeit.timeit(stmt=s, number=100000)

产出：

7.60287380216 ## seconds to complete same number of times.

如果

do\u someprocess（）

与读取行相比需要很长时间，并且您有额外的CPU内核，则可以使用多处理模块

如果可能的话，尝试使用pypy。对于一些计算密集型任务，它比cpython快几十倍

如果文件中有大量重复的int，使用dict映射会比

int（）

更快，因为这样可以节省创建新int对象的时间

第一步是按照@nathancahill在评论中的建议进行简介。然后将精力集中在可以获得最大收益的部分。

someprocess（）做什么？您确定

split（）

和

int（）

是占用时间最多的函数吗？您可以运行python-m cProfile myscript.py，因此您确定要优化正确的函数。请注意，如果

line

包含的元素超过3个，则此操作将失败。最好说

a，b，c=line.split（）[：3]

OP在他/她的示例中只提到了三个数据值。您的代码跳过了将值转换为int@gnibbler你是对的。我现在已经编辑了我的代码。