提高Python代码速度
有人知道如何提高这部分python代码的速度吗? 它被设计用来处理小文件(只有几行,因此速度非常快),但我想用大文件(大约50Gb,数百万行)运行它 这段代码的主要目标是从文件(.txt)中获取stings,并在输入文件中搜索这些stings,打印输出文件中出现这些stings的次数 代码如下:提高Python代码速度,python,performance,Python,Performance,有人知道如何提高这部分python代码的速度吗? 它被设计用来处理小文件(只有几行,因此速度非常快),但我想用大文件(大约50Gb,数百万行)运行它 这段代码的主要目标是从文件(.txt)中获取stings,并在输入文件中搜索这些stings,打印输出文件中出现这些stings的次数 代码如下:infle、seqList和out由optpass确定为代码开头的选项(未显示) def novo(填充、序列列表、输出): uDic=dict() rDic=dict() nmDic=dict() 将o
infle
、seqList
和out
由optpass确定为代码开头的选项(未显示)
def novo(填充、序列列表、输出):
uDic=dict()
rDic=dict()
nmDic=dict()
将open(infle,'r')作为infle,open(seqList,'r')作为RADlist:
samples=[line.strip()表示RADlist中的行]
lines=[line.strip()用于填充中的线]
#使用所有示例创建词典
对于样品中的i:
uDic[i.replace(“,”)]=0
rDic[i.replace(“,”)]=0
nmDic[i.replace(“,”)]=0
对于k in行:
l1=k.分割(“\t”)
l2=l1[0]。拆分(“;”)
l3=l2[0]。替换(“>”,“”)
如果len(l1)如果您将脚本转换为函数(这使评测更容易),然后查看在编写评测代码时它会做什么:我建议使用runsnake:我会尝试用列表和字典理解替换您的循环:
例如,代替
for i in samples:
uDict[i.replace(" ","")] = 0
尝试:
同样的,其他的格言也是如此
我并没有真正了解“for k in line”循环中发生了什么,但当l1[4]有一定的值时,只需要l3(和l2)。为什么不在拆分和替换之前检查这些值
最后,不要遍历dict的所有键来查看给定元素是否在该dict中,请尝试:
if x in myDict:
myDict[x] = ....
例如:
for k in uDic.keys():
if k == l3:
uDic[k] += 1
可替换为:
if l3 in uDic:
uDic[l3] += 1
除此之外,请尝试分析。对于处理50 GB文件,问题不在于如何加快速度,而在于如何使其可运行
一点也不
主要问题是,您将耗尽内存,并应修改代码以处理文件
没有所有的文件在内存中,而是在内存中只有一行,这是需要的
您问题中的以下代码正在读取两个文件中的所有行:
with open(infile, 'r') as infile, open(seqList, 'r') as RADlist :
samples = [line.strip() for line in RADlist]
lines = [line.strip() for line in infile]
# at this moment you are likely to run out of memory already
#Create dictionaires with all the samples
for i in samples:
uDic[i.replace(" ","")] = 0
rDic[i.replace(" ","")] = 0
nmDic[i.replace(" ","")] = 0
#similar loop over `lines` comes later on
您应将阅读这些行的时间推迟到最新的可能时刻,如下所示:
#Create dictionaires with all the samples
with open(seqList, 'r') as RADlist:
for samplelines in RADlist:
sample = sampleline.strip()
for i in samples:
uDic[i.replace(" ","")] = 0
rDic[i.replace(" ","")] = 0
nmDic[i.replace(" ","")] = 0
注意:是否要使用line.strip()
或line.split()
这样,您就不必将所有内容都保存在内存中
有更多的优化选项,但这一个可以让您启动并运行。1)查看探查器并调整花费最多时间的代码
2) 您可以尝试使用Cython优化某些方法—使用profiler中的数据修改正确的内容
3) 看起来你可以用一个计数器来代替输出文件的dict,用一个集合来代替输入文件的dict
set = set()
from Collections import Counter
counter = Counter() # Essentially a modified dict, that is optimized for counting...
# like counting occurences of strings in a text file
4) 如果您正在读取50GB的内存,您将无法将其全部存储在RAM中(我假设谁知道您拥有哪种类型的计算机),因此生成器应该可以节省您的内存和时间
#change list comprehension to generators
samples = (line.strip() for line in RADlist)
lines = (line.strip() for line in infile)
如果您提供一些示例输入,将使它变得更容易。因为你还没有,我还没有测试过这个,但是想法很简单——只对每个文件迭代一次,使用迭代器,而不是将整个文件读入内存。使用高效的集合.计数器对象来处理计数并最小化内部循环:
def novo (infile, seqList, out):
from collections import Counter
import csv
# Count
counts = Counter()
with open(infile, 'r') as infile:
for line in infile:
l1 = line.strip().split("\t")
l2 = l1[0].split(";")
l3 = l2[0].replace(">","")
if len(l1)<2:
continue
counts[(l1[4], l3)] += 1
# Produce output
types = ['R', 'U', 'NM']
with open(seqList, 'r') as RADlist, open(out, 'w') as outfile:
f = csv.writer(outfile, delimiter='\t')
f.writerow(types + ['TOTAL'] + ['%' + t for t in types])
for sample in RADlist:
sample = sample.strip()
countrow = [counts((t, sample)) for t in types]
total = sum(countrow)
f.writerow([sample] + countrow + [total] + [c/total for c in countrow])
samples = [line.strip() for line in RADlist]
lines = [line.strip() for line in infile]
def novo(填充、序列列表、输出):
从收款进口柜台
导入csv
#计数
计数=计数器()
将打开(内嵌,'r')作为内嵌:
对于填充中的线:
l1=line.strip().split(“\t”)
l2=l1[0]。拆分(“;”)
l3=l2[0]。替换(“>”,“”)
如果len(l1)用于大小为N的数据集,则运行需要多长时间?您的目标/预期持续时间是多少?您认为该代码中存在哪些具体的低效之处?如果没有更多的细节,就无法有效地回答这个问题。这更像是一个评论,因为“问题”是如何提高速度,而不是如何寻找瓶颈。
#change list comprehension to generators
samples = (line.strip() for line in RADlist)
lines = (line.strip() for line in infile)
def novo (infile, seqList, out):
from collections import Counter
import csv
# Count
counts = Counter()
with open(infile, 'r') as infile:
for line in infile:
l1 = line.strip().split("\t")
l2 = l1[0].split(";")
l3 = l2[0].replace(">","")
if len(l1)<2:
continue
counts[(l1[4], l3)] += 1
# Produce output
types = ['R', 'U', 'NM']
with open(seqList, 'r') as RADlist, open(out, 'w') as outfile:
f = csv.writer(outfile, delimiter='\t')
f.writerow(types + ['TOTAL'] + ['%' + t for t in types])
for sample in RADlist:
sample = sample.strip()
countrow = [counts((t, sample)) for t in types]
total = sum(countrow)
f.writerow([sample] + countrow + [total] + [c/total for c in countrow])
samples = [line.strip() for line in RADlist]
lines = [line.strip() for line in infile]