使用python删除循环中的冗余_Python_Bioinformatics

使用python删除循环中的冗余

python

使用python删除循环中的冗余,python,bioinformatics,Python,Bioinformatics,我有一个巨大的输入文件，看起来像这样 contig protein start end con1 P1 140 602 con1 P2 140 602 con1 P3 232 548 con2 P4 335 801 con2 P5 642 732 con2 P6 335 779 con2 P7 729 812 con3 P8 17 348 con3 P9 16 348 我想删除同源的p或冗余的p，我假设它们分别具有相

我有一个巨大的输入文件，看起来像这样

contig  protein start end
con1    P1  140 602
con1    P2  140 602
con1    P3  232 548
con2    P4  335 801
con2    P5  642 732
con2    P6  335 779
con2    P7  729 812
con3    P8  17  348
con3    P9  16  348

我想删除同源的p或冗余的p，我假设它们分别具有相同的起始点和结束点以及较小的起始点或结束点。所以我的输出文件是这样的， file.txt

尝试脚本，由于某些原因，它不满足这两个条件

from itertools import groupby
def non_homolog(hits):
    nonhomolog=[]
    overst = False
    for i in range(1,len(hits)):
        (p, c) = hits[i-1], hits[i]
        if p[2] <= c[2] and c[3] <= p[3]:
            if not overst: nonhomolog.append(c)
            nonhomolog.append(c)
            overst = True   
    return nonhomolog

fh = open('example.txt')
oh = open('nonhomologs.txt', 'w')
for qid, grp in groupby(fh, lambda l: l.split()[0]):
    hits = []
    for line in grp:
        hsp = line.split()
        hsp[2], hsp[3] = int(hsp[2]), int(hsp[3])
        hits.append(hsp)
    hits.sort(key=lambda x: x[2])
    if non_homolog(hits):
        for hit in hits:
            oh.write('\t'.join([str(f) for f in hit])+'\n')

从itertools导入groupby
def非_同系物（点击次数）：
非同系物=[]
overs=假
对于范围内的i（1，len（hits））：
（p，c）=命中率[i-1]，命中率[i]
如果p[2]试着穿上这个来确定尺寸：
# this code assumes Python 2.7
from itertools import groupby, izip
from operator import attrgetter

INPUT    = "file.txt"
HOMO_YES = "homologs.txt"
HOMO_NO  = "nonhomologs.txt"
MAX_DIFF = 5

class Row:
    __slots__ = ["line", "con", "protein", "start", "end"]

    def __init__(self, s):
        self.line    = s.rstrip()
        data         = s.split()
        self.con     = data[0]
        self.protein = data[1]
        self.start   = int(data[2])
        self.end     = int(data[3])

    def __str__(self):
        return self.line

def count_homologs(items, max_diff=MAX_DIFF):
    num_items  = len(items)
    counts     = [0] * num_items
    # first item
    for i, item_i in enumerate(items):
        max_start = item_i.start + max_diff
        max_end   = item_i.end   + max_diff
        # second item
        for j in xrange(i+1, num_items):
            item_j = items[j]
            if item_j.start > max_start:
                break
            elif item_j.end <= max_end:
                counts[i] += 1
                counts[j] += 1
    return counts

def main():
    with open(INPUT) as inf, open(HOMO_YES, "w") as outhomo, open(HOMO_NO, "w") as outnothomo:
        # skip header
        next(inf, '')
        rows = (Row(line) for line in inf)

        for con, item_iter in groupby(rows, key=attrgetter("con")):
            # per-con list of Rows sorted by start,end
            items = sorted(item_iter, key=attrgetter("start", "end"))
            # get #homologs for each item
            counts = count_homologs(items)
            # do output
            for c,item in izip(counts, items):
                if c:
                    outhomo.write(str(item) + "\n")
                else:
                    outnothomo.write(str(item) + "\n")

if __name__=="__main__":
    main()

==nonhomologs.txt===
con1    P1  140 602
con1    P2  140 602
con3    P9  16  348
con3    P8  17  348

con1    P3  232 548
con2    P6  335 779
con2    P4  335 801
con2    P5  642 732
con2    P7  729 812

可能的问题：如果我们有三个项目，比如说con1p1140602
，con1p2144602
<代码>con1 p3 148 602

=>p1与p2同源，p2与p3同源，但p1与p3不同源；这应该怎么处理？我主要感兴趣的是曾经有0个差异，基本相同的起始站点，那么，只是比较结果考虑也有+-5个单位的差异，可能比较少。所以我想，你提到的案例在后面的案例中会涉及到同系物。@Hugh Bothwell，说实话，我解决这些问题有点熟了，所以现在可能最好集中精力解决那些相同的问题，我必须考虑这个案例，因为我的一些con的值是+-5，你能不能四舍五入到下一个最低的5（801和803到800、807和808到805）？然后您可以将它们视为相同的进行比较，甚至可以在dict中使用（140600）这样的元组作为键，以便于查找。@user3224522:因此，换言之：同系物包含所有与任何其他内容同源的行（而不仅仅是成对的同源值）？您好，谢谢您的回答！我试图运行它，但在打开块时，它给了我一个无效的语法错误。您使用的是什么版本的Python？这是您的问题：-）2.6支持

与一起使用，但不支持链接：而不是与a，b，c:
一起使用a:b:c:
您必须使用与a:b:c:一起使用。
con1    P3  232 548
con2    P6  335 779
con2    P4  335 801
con2    P5  642 732
con2    P7  729 812