python中字符串搜索的优化
我必须编写一个python程序,给定一个50 MB的大DNA序列和一个约15个字符的较小DNA序列,返回一个15个字符的所有序列的列表,按它们与给定序列的距离以及它们在较大序列中的位置排序 我目前的方法是首先获得所有子序列:python中字符串搜索的优化,python,bioinformatics,Python,Bioinformatics,我必须编写一个python程序,给定一个50 MB的大DNA序列和一个约15个字符的较小DNA序列,返回一个15个字符的所有序列的列表,按它们与给定序列的距离以及它们在较大序列中的位置排序 我目前的方法是首先获得所有子序列: def get_subsequences_of_size(size, data): sequences = {} i = 0 while(i+size <= len(data)): sequence = data[i:i+siz
def get_subsequences_of_size(size, data):
sequences = {}
i = 0
while(i+size <= len(data)):
sequence = data[i:i+size]
if sequence not in sequences:
sequences[sequence] = data.count(sequence)
i += 1
return sequences
我的问题是这条路太慢了。对于50MB的输入,完成处理需要30分钟以上。以下方法如何: 在长序列和每个子序列上使用长度为15的滑动窗口:
- 将开始位置存储在长序列上
- 计算并存储相似度
Sequence(s) which differ in 0 base(s) from the short sequence:
TGGCGACGGACTTCA at location(s) 300, 500
Sequence(s) which differ in 5 base(s) from the short sequence:
TGGCGATCGCCGTCG at location(s) 362
Sequence(s) which differ in 6 base(s) from the short sequence:
TGGCAACTACCTGAA at location(s) 86
TGGTGAGTATTTTCA at location(s) 401
TGGCGAGGGGGATGC at location(s) 191
Sequence(s) which differ in 7 base(s) from the short sequence:
ATGTGAAGGATGTGA at location(s) 283
AGGGGGATGCCTTCT at location(s) 196
TGACAACAACGTTTA at location(s) 53
CGCTGACGGATTATG at location(s) 154
TTATGACCGTTTTCC at location(s) 164
TGGTTGCTGGTTTCC at location(s) 430
TCGCGTCAGCCCGGA at location(s) 8
AGTCGCCTGAGTCCG at location(s) 30, 536
CGGCGATGTGGTTGC at location(s) 422
[... and so on...]
我还在50MB的FASTA文件上运行了脚本。在我的机器上,计算结果花了42秒,将结果写入文件又花了30秒(打印出来要花更长的时间!)你能解释一下你的问题是什么吗?老兄,我编辑了这么多次,一定是忘了。很抱歉我编辑了这篇文章来添加它,但问题是它对我来说太慢了。50MB的输入需要30分钟才能运行。也许可以尝试重新表述你的科学问题,而不是寻找所有15个k-mers,尝试寻找完美的命中率、一个突变、两个突变等。或者看看这个库:试试这里
import re
from itertools import islice
from collections import defaultdict
short_seq = 'TGGCGACGGACTTCA'
long_seq = 'AGAACGTTTCGCGTCAGCCCGGAAGTGGTCAGTCGCCTGAGTCCGAACAAAAATGACAACAACGTTTATGACAGAACATT' +\
'CCTTGCTGGCAACTACCTGAAAATCGGCTGGCCGTCAGTCAATATCATGTCCTCATCAGATTATAAATGCGTGGCGCTGA' +\
'CGGATTATGACCGTTTTCCGGAAGATATTGATGGCGAGGGGGATGCCTTCTCTCTTGCCTCAAAACGTACCACCACATTT' +\
'ATGTCCAGTGGTATGACGCTGGTGGAGAGTTCCCCCGGCAGGGATGTGAAGGATGTGAAATGGCGACGGACTTCACCGCA' +\
'TGAGGCTCCACCAACCACGGGGATACTGTCGCTCTATAACCGTGGCGATCGCCGTCGCTGGTACTGGCCCTGTCCACACT' +\
'GTGGTGAGTATTTTCAGCCCTGCGGCGATGTGGTTGCTGGTTTCCGTGATATTGCCGATCCCGTGCTGGCAAGTGAGGCG' +\
'GCTTATATTCAGTGTCCTTCTGGCGACGGACTTCACGCGTCAGCCCGGAAGTGGTCAGTCGCCTGAGTCCGAACAAAAAT'
def window(seq, n=2):
"Returns a sliding window (of width n) over data from the iterable"
" s -> (s0,s1,...s[n-1]), (s1,s2,...,sn), ... "
# from https://docs.python.org/release/2.3.5/lib/itertools-example.html
it = iter(seq)
result = tuple(islice(it, n))
if len(result) == n:
yield ''.join(result)
for elem in it:
result = result[1:] + (elem,)
yield ''.join(result)
def hamming_distance(s1, s2):
if len(s1) != len(s2):
raise ValueError("Undefined for sequences of unequal length")
return sum(ch1 != ch2 for ch1, ch2 in zip(s1, s2))
k = len(short_seq)
locations = defaultdict(list)
similarities = defaultdict(set)
for start, subseq in enumerate(window(long_seq, k)):
locations[subseq].append(start)
similarity = hamming_distance(subseq, short_seq) # substitute with your own similarity function
similarities[similarity].add(subseq)
with open(r'stack46268997.txt', 'w') as f:
for similarity in sorted(similarities.keys()):
f.write("Sequence(s) which differ in {} base(s) from the short sequence:\n".format(similarity))
for subseq in similarities[similarity]:
f.write("{} at location(s) {}\n".format(subseq, ', '.join(map(str, locations[subseq]))))
f.write('\n')
Sequence(s) which differ in 0 base(s) from the short sequence:
TGGCGACGGACTTCA at location(s) 300, 500
Sequence(s) which differ in 5 base(s) from the short sequence:
TGGCGATCGCCGTCG at location(s) 362
Sequence(s) which differ in 6 base(s) from the short sequence:
TGGCAACTACCTGAA at location(s) 86
TGGTGAGTATTTTCA at location(s) 401
TGGCGAGGGGGATGC at location(s) 191
Sequence(s) which differ in 7 base(s) from the short sequence:
ATGTGAAGGATGTGA at location(s) 283
AGGGGGATGCCTTCT at location(s) 196
TGACAACAACGTTTA at location(s) 53
CGCTGACGGATTATG at location(s) 154
TTATGACCGTTTTCC at location(s) 164
TGGTTGCTGGTTTCC at location(s) 430
TCGCGTCAGCCCGGA at location(s) 8
AGTCGCCTGAGTCCG at location(s) 30, 536
CGGCGATGTGGTTGC at location(s) 422
[... and so on...]