Python，巨大的迭代性能问题_Python_Iteration_Bioinformatics

Python，巨大的迭代性能问题

python

Python，巨大的迭代性能问题,python,iteration,bioinformatics,Python,Iteration,Bioinformatics,我正在对3个单词进行迭代，每个单词大约500万个字符长，我想找到识别每个单词的20个字符的序列。也就是说，我想在一个单词中找到所有长度为20的序列，这个单词是唯一的。我的问题是，我编写的代码需要非常长的时间才能运行。我甚至连一个字都没有完成，在晚上运行我的程序下面的函数获取一个包含字典的列表，其中每个字典包含20个可能的单词，以及它在500万个长单词中的位置如果有人知道如何优化这个，我会非常感激，我不知道如何继续以下是我的代码示例： def findUnique(list): #

我正在对3个单词进行迭代，每个单词大约500万个字符长，我想找到识别每个单词的20个字符的序列。也就是说，我想在一个单词中找到所有长度为20的序列，这个单词是唯一的。我的问题是，我编写的代码需要非常长的时间才能运行。我甚至连一个字都没有完成，在晚上运行我的程序

下面的函数获取一个包含字典的列表，其中每个字典包含20个可能的单词，以及它在500万个长单词中的位置

如果有人知道如何优化这个，我会非常感激，我不知道如何继续

以下是我的代码示例：

def findUnique(list):
    # Takes a list with dictionaries and compairs each element in the dictionaries
    # with the others and puts all unique element in new dictionaries and finally
    # puts the new dictionaries in a list.
    # The result is a list with (in this case) 3 dictionaries containing all unique
    # sequences and their locations from each string.
    dicList=[]
    listlength=len(list)
    s=0
    valuelist=[]
    for i in list:
        j=i.values()
        valuelist.append(j)
    while s<listlength:
        currdic=list[s]
        dic={}
        for key in currdic:
            currval=currdic[key]
            test=True
            n=0
            while n<listlength:
                if n!=s:
                    if currval in valuelist[n]: #this is where it takes to much time
                        n=listlength
                        test=False
                    else:
                        n+=1
                else:
                    n+=1
            if test:
                dic[key]=currval
        dicList.append(dic)
        s+=1
    return dicList

def findUnique（列表）：
#获取包含字典的列表，并比较字典中的每个元素
#并将所有独特元素放入新词典中，最后
#将新词典放入列表中。
#结果是一个包含（在本例中）3个字典的列表，其中包含所有唯一的
#每个字符串中的序列及其位置。
dicList=[]
listlength=len（列表）
s=0
价值清单=[]
对于列表中的i：
j=i.值（）
valuelist.append（j）
而
此函数（当前在my中）是O（n）（n是每个单词的长度），您可以使用set（slices（..）
（使用诸如差分之类的设置操作）在所有单词中获得唯一的片段（下面的示例）。如果不想跟踪位置，还可以编写函数返回集合。内存使用率将很高（虽然仍然是O（n），只是一个很大的因素），可能会通过一个特殊的存储基序列（字符串）加上开始和停止（或开始和长度）来减轻（如果长度只有20，则不会太大）
打印唯一切片：
a = set(slices("aab", 2)) # {"aa", "ab"}
b = set(slices("abb", 2)) # {"ab", "bb"}
c = set(slices("abc", 2)) # {"ab", "bc"}
all = [a, b, c]
import operator
a_unique = reduce(operator.sub, (x for x in all if x is not a), a)
print a_unique # {"aa"}

包括地点：
a = slices("aab", 2)
b = slices("abb", 2)
c = slices("abc", 2)
all = [a, b, c]
import operator
a_unique = reduce(operator.sub, (set(x) for x in all if x is not a), set(a))
# a_unique is only the keys so far
a_unique = dict((k, a[k]) for k in a_unique)
# now it's a dict of slice -> location(s)
print a_unique # {"aa": 0} or {"aa": [0]}
               # (depending on which slices function used)


在一个更接近您的条件的测试脚本中，使用5个字符的随机生成的单词和20个片长，内存使用率如此之高，以至于我的测试脚本很快达到了1G主内存限制，并开始冲击虚拟内存。在这一点上，Python在CPU上花费的时间很少，我就把它杀死了。减少片长或字长（因为我使用了完全随机的字，减少了重复并增加了内存使用）以适应主内存，它在一分钟内运行。这种情况加上原始代码中的O（n**2）将永远持续下去，这就是为什么算法的时间和空间复杂性都很重要
import operator
import random
import string

def slices(seq, length):
  unique = {}
  for start in xrange(len(seq) - length, -1, -1):
    unique[seq[start:start+length]] = start
  return unique

def sample_with_repeat(population, length, choice=random.choice):
  return "".join(choice(population) for _ in xrange(length))

word_length = 5*1000*1000
words = [sample_with_repeat(string.lowercase, word_length) for _ in xrange(3)]
slice_length = 20
words_slices_sets = [set(slices(x, slice_length)) for x in words]
unique_words_slices = [reduce(operator.sub,
                              (x for x in words_slices_sets if x is not n),
                              n)
                       for n in words_slices_sets]
print [len(x) for x in unique_words_slices]

你说你有一个500万字符长的“单词”，但我很难相信这是一个通常意义上的单词
如果可以提供有关输入数据的更多信息，则可能会提供特定的解决方案
例如，英语文本（或任何其他书面语言）可能具有足够的重复性，因此可以使用。然而，在最坏的情况下，构造所有256^20个键时，它将耗尽内存。了解你的投入会使一切变得不同

编辑
我查看了一些基因组数据，看看这个想法是如何叠加的，使用了一个硬编码的[acgt]->[0123]映射和每个trie节点4个子节点
腺病毒2:35937bp->35899使用469339 trie节点的不同20碱基序列
肠杆菌噬菌体λ：48502bp->40921不同的20个碱基序列，使用529384个trie节点
我没有发现任何冲突，无论是在两个数据集中还是在两个数据集之间，尽管可能在您的数据中存在更多冗余和/或重叠。你得试试看
如果您确实获得了大量有用的冲突，可以尝试将三个输入走到一起，构建一个trie，记录每个叶的原点，并在运行时从trie中删除冲突
如果找不到修剪键的方法，可以尝试使用更紧凑的表示法。例如，您只需要2位来存储[acgt]/[0123]，这可能会以稍微复杂的代码为代价节省空间
不过，我认为你不能强行这样做——你需要找到某种方法来缩小问题的规模，这取决于你的领域知识。
让我来进一步阐述。如果内存有问题，我建议不要使用字符串作为字典的键，可以使用字符串的散列值。这将节省将字符串的额外副本存储为键的成本（最坏情况下，是单个“单词”存储的20倍）
（如果你真的想变得有趣，你可以使用，尽管你需要改变函数。）
现在，我们可以组合所有哈希：
unique = []  # Unique words in first string

# create a dictionary of hash values -> word index -> start position
hashed_starts = [hashed_slices(word, 20, hashing_fcn) for word in words]
all_hashed = collections.defaultdict(dict)
for i, hashed in enumerate(hashed_starts) :
   for h, starts in hashed.iteritems() :
     # We only care about the first word
     if h in hashed_starts[0] :
       all_hashed[h][i]=starts

# Now check all hashes
for starts_by_word in all_hashed.itervalues() :
  if len(starts_by_word) == 1 :
    # if there's only one word for the hash, it's obviously valid
    unique.extend(words[0][i:i+20] for i in starts_by_word.values())
  else :
    # we might have a hash collision
    candidates = {}
    for word_idx, starts in starts_by_word.iteritems() :
      candidates[word_idx] = set(words[word_idx][j:j+20] for j in starts)
    # Now go that we have the candidate slices, find the unique ones
    valid = candidates[0]
    for word_idx, candidate_set in candidates.iteritems() :
      if word_idx != 0 :
        valid -= candidate_set
    unique.extend(valid)

（我尝试将其扩展到所有三种情况。这是可能的，但复杂的操作会影响算法。）
请注意，我还没有测试过这个。此外，您可能可以做很多事情来简化代码，但该算法是有意义的。困难的部分是选择散列。碰撞太多，你将一无所获。太少会导致内存问题。如果只处理DNA基代码，则可以将20个字符的字符串散列为40位的数字，并且仍然没有冲突。因此，切片将占用近四分之一的内存。在Roger Pate的回答中，这将节省大约250 MB的内存
代码仍然是O（N^2），但常数应该低得多。
让我们尝试改进
首先，让我们保留集合而不是字典——它们管理唯一性
其次，由于我们很可能会以比CPU时间（和耐心）更快的速度耗尽内存，因此我们可以为了内存效率而牺牲CPU效率。因此，也许只尝试以一个特定字母开头的20多岁。对于DNA，这将要求降低75%
seqlen = 20
maxlength = max([len(word) for word in words])
for startletter in letters:
    for letterid in range(maxlength):
        for wordid,word in words:
            if (letterid < len(word)):
                letter = word[letterid]
                if letter is startletter:
                    seq = word[letterid:letterid+seqlen]
                    if seq in seqtrie and not wordid in seqtrie[seq]:
                        seqtrie[seq].append(wordid)

seqlen=20
maxlength=max（[len（word）表示单词中的单词]）
以字母表示的信：
对于范围内的letterid（最大长度）：
对于wordid，word in words：
如果（字母IDunique = []  # Unique words in first string

# create a dictionary of hash values -> word index -> start position
hashed_starts = [hashed_slices(word, 20, hashing_fcn) for word in words]
all_hashed = collections.defaultdict(dict)
for i, hashed in enumerate(hashed_starts) :
   for h, starts in hashed.iteritems() :
     # We only care about the first word
     if h in hashed_starts[0] :
       all_hashed[h][i]=starts

# Now check all hashes
for starts_by_word in all_hashed.itervalues() :
  if len(starts_by_word) == 1 :
    # if there's only one word for the hash, it's obviously valid
    unique.extend(words[0][i:i+20] for i in starts_by_word.values())
  else :
    # we might have a hash collision
    candidates = {}
    for word_idx, starts in starts_by_word.iteritems() :
      candidates[word_idx] = set(words[word_idx][j:j+20] for j in starts)
    # Now go that we have the candidate slices, find the unique ones
    valid = candidates[0]
    for word_idx, candidate_set in candidates.iteritems() :
      if word_idx != 0 :
        valid -= candidate_set
    unique.extend(valid)

seqlen = 20
maxlength = max([len(word) for word in words])
for startletter in letters:
    for letterid in range(maxlength):
        for wordid,word in words:
            if (letterid < len(word)):
                letter = word[letterid]
                if letter is startletter:
                    seq = word[letterid:letterid+seqlen]
                    if seq in seqtrie and not wordid in seqtrie[seq]:
                        seqtrie[seq].append(wordid)