优化Python过滤程序的提示
我一直在编写一个非常简单的程序,其要点如下所示:优化Python过滤程序的提示,python,optimization,Python,Optimization,我一直在编写一个非常简单的程序,其要点如下所示: post = open(INPUTFILE1, "rb") for line in post: cut = line.split(',') pre = open(INPUTFILE2, "rb") for otherline in pre: cuttwo = otherline.split(',') if cut[1] == cuttw
post = open(INPUTFILE1, "rb")
for line in post:
cut = line.split(',')
pre = open(INPUTFILE2, "rb")
for otherline in pre:
cuttwo = otherline.split(',')
if cut[1] == cuttwo[1] and cut[3] == cuttwo[3] and cut[9] == cuttwo[9]:
OUTPUTFILE.write(otherline)
break
post.close()
pre.close()
OUTPUTFILE.close()
实际上,这是将两个csv文件作为输入(一个“pre”和一个“post”)。它查看“post”数据中的第一行,并尝试在“pre”数据中查找与第2、4和10列匹配的行。如果存在匹配项,它会将“pre”数据写入新文件
它工作得很好,但需要很长时间。虽然我的“post”数据可能只有几百行(可能多达1000行),但我的“pre”数据可能多达1500万行。因此,可能需要大约10个小时才能完成
我对Python还相当陌生,所以我还需要学习很多优化技术。有人对我可以尝试什么有什么建议吗?显然,我知道当我搜索整个“pre”数据进行匹配时,会出现僵局。有什么方法可以加快速度吗?如果只有几百行是潜在的,那么使用以下方法:
from operator import itemgetter
key = itemgetter(1, 3, 9)
with open('smallfile') as fin:
valid = set(key(line.split(',')) for line in fin)
with open('largerfile') as fin:
lines = (line.split(',') for line in fin)
for line in lines:
if key(line) in valid:
# do something....
with open('largerfile') as fin:
lines = (line.split(',') for line in fin)
for line in lines:
otherline = valid.get(key(line), None)
if otherline is not None:
# do something....
这节省了不必要的迭代,并充分利用了Python内置元素以实现高效查找
如果要在输出中使用小文件的整行(如果存在匹配项),请使用字典而不是集合:
from operator import itemgetter
key = itemgetter(1, 3, 9)
with open('smallfile') as fin:
valid = dict((key(line.split(',')), line) for line in fin)
然后您的处理循环将类似于:
from operator import itemgetter
key = itemgetter(1, 3, 9)
with open('smallfile') as fin:
valid = set(key(line.split(',')) for line in fin)
with open('largerfile') as fin:
lines = (line.split(',') for line in fin)
for line in lines:
if key(line) in valid:
# do something....
with open('largerfile') as fin:
lines = (line.split(',') for line in fin)
for line in lines:
otherline = valid.get(key(line), None)
if otherline is not None:
# do something....
如果只有几百行是潜在的,那么使用如下方法:
from operator import itemgetter
key = itemgetter(1, 3, 9)
with open('smallfile') as fin:
valid = set(key(line.split(',')) for line in fin)
with open('largerfile') as fin:
lines = (line.split(',') for line in fin)
for line in lines:
if key(line) in valid:
# do something....
with open('largerfile') as fin:
lines = (line.split(',') for line in fin)
for line in lines:
otherline = valid.get(key(line), None)
if otherline is not None:
# do something....
这节省了不必要的迭代,并充分利用了Python内置元素以实现高效查找
如果要在输出中使用小文件的整行(如果存在匹配项),请使用字典而不是集合:
from operator import itemgetter
key = itemgetter(1, 3, 9)
with open('smallfile') as fin:
valid = dict((key(line.split(',')), line) for line in fin)
然后您的处理循环将类似于:
from operator import itemgetter
key = itemgetter(1, 3, 9)
with open('smallfile') as fin:
valid = set(key(line.split(',')) for line in fin)
with open('largerfile') as fin:
lines = (line.split(',') for line in fin)
for line in lines:
if key(line) in valid:
# do something....
with open('largerfile') as fin:
lines = (line.split(',') for line in fin)
for line in lines:
otherline = valid.get(key(line), None)
if otherline is not None:
# do something....
这是一个比苏更重要的问题。不要重复处理文件,只需执行一次并缓存结果@JonClements——我添加了一个使用字典的示例,以便在以后需要时检索整行内容。希望你没问题。不要重复处理文件,只需执行一次并缓存结果@JonClements——我添加了一个使用字典的示例,以便在以后需要时检索整行内容。希望你没问题。