Python-重叠范围-确定唯一位置

Python-重叠范围-确定唯一位置,python,list,range,bigdata,overlapping,Python,List,Range,Bigdata,Overlapping,我有一个很大的数据集,其中每个值分为3部分[染色体,开始,结束]。什么是计算每个染色体所有唯一位置的最快方法,因为我有很多重叠范围 例如: [[chr1:10:60',chr1:5:70',chr3:50:80',chr1:54:90',chr1:120:180',chr3:50:90'] 应导致: ['chr1:5:90','chr1:120:180','chr3:50:90'] 我不知道是否有一个简单的计算方法?但我发现在这里问这个问题是值得的。下面是我数据的一个子集 提前感谢, [['ch

我有一个很大的数据集,其中每个值分为3部分
[染色体,开始,结束]
。什么是计算每个染色体所有唯一位置的最快方法,因为我有很多重叠范围

例如:
[[chr1:10:60',chr1:5:70',chr3:50:80',chr1:54:90',chr1:120:180',chr3:50:90']

应导致:
['chr1:5:90','chr1:120:180','chr3:50:90']

我不知道是否有一个简单的计算方法?但我发现在这里问这个问题是值得的。下面是我数据的一个子集

提前感谢,

[['chr9:95149330:95149362', 'chr9:95149330:95149362', 'chr17:70386266:70386304', 'chr17:70386256:70386304', 'chr2:44672786:44672833', 'chr2:44672785:44672833', 'chr2:141966446:141966479', 'chr2:141966446:141966488', 'chr19:18126909:18126938', 'chr19:18126909:18127027', 'chr3:145082003:145082051', 'chr3:145082014:145082121', 'chr6:38835529:38835560', 'chr6:38835529:38835560', 'chr4:120372932:120372986', 'chr4:120372932:120372994', 'chr2:141014019:141014057', 'chr2:141014014:141014057', 'chr18:3445722:3445761', 'chr18:3445722:3445793', 'chr17:72329982:72330015', 'chr17:72329982:72330015', 'chr5:169911920:169911962', 'chr5:169911917:169911962', 'chr4:146482176:146482219', 'chr4:146482176:146482219', 'chr9:104285900:104285935', 'chr9:104285879:104285935', 'chr12:32941976:32942016', 'chr12:32941976:32942028', 'chrX:127923156:127923189', 'chrX:127923156:127923189', 'chr2:9535703:9535755', 'chr2:9535701:9535755', 'chr8:86476618:86476684', 'chr8:86476554:86476642', 'chr9:135756650:135756696', 'chr9:135756650:135756706', 'chr6:103004873:103004932', 'chr6:103004861:103004918', 'chr8:86476618:86476684', 'chr8:86476556:86476648', 'chr1:52280846:52280876', 'chr1:52280845:52280876', 'chr8:86476635:86476685', 'chr8:86476553:86476645', 'chr5:116046573:116046620', 'chr5:116046564:116046615', 'chrX:68039214:68039252', 'chrX:68039214:68039252', 'chr4:181491919:181491953', 'chr4:181491919:181491960', 'chr18:68050122:68050166', 'chr18:68050122:68050166', 'chr2:233985816:233985860', 'chr2:233985808:233985860', 'chr6:17020712:17020750', 'chr6:17020712:17020759', 'chr7:21950625:21950666', 'chr7:21950625:21950666', 'chr12:93292486:93292536', 'chr12:93292481:93292537', 'chr1:246515439:246515472', 'chr1:246515440:246515486', 'chr12:57084093:57084130', 'chr12:57084093:57084134', 'chr1:174801431:174801474', 'chr1:174801431:174801485', 'chr7:92499684:92499734', 'chr7:92499924:92499960', 'chr17:40328527:40328560', 'chr17:40328518:40328560', 'chr8:42944072:42944110', 'chr8:42944073:42944120', 'chr17:29890450:29890499']

我将分三个步骤进行:

  • 划分每个染色体的范围
  • 提取连续范围;及
  • 根据需要组装输出(
    “chr:start:end”
  • 第一步:

    from collections import defaultdict
    
    processed = defaultdict(list)
    
    for s in data:
        chr_, pos = s.split(":", 1)
        processed[chr_].append(list(map(int, pos.split(":"))))
    
    为了

    这给

    processed == defaultdict(<class 'list'>, 
                             {'chr3': [[50, 80], [50, 90]], 
                              'chr1': [[10, 60], [5, 70], [54, 90], [120, 180]]})
    
    剩下的是:

    final == ['chr3:50:90', 'chr1:10:90', 'chr1:120:180']
    

    我同意jonrsharpe的一般方法,但我认为有一种更优雅的方法

    首先,我们将得到每个染色体的范围(与jonrsharpe几乎相同,尽管我更喜欢元组而不是范围列表)

    现在,我们可以通过按范围的开始对每个染色体的列表进行排序来简化合并。这为我们提供了一个很好的特性,即如果前面的所有范围都不与当前范围重叠,那么我们知道我们对前面的值所做的任何合并都是最终的,我们不必返回到它

    for vals in processed.values():
        vals.sort()
        current = 1
        while current < len(vals):
          if vals[current-1][1] > vals[current][0]:
            # current and previous ranges overlap, so merge previous and current values.
            vals[current-1:current+1] = [(vals[current-1][0], vals[current][1])]
            # Because we reduced the number of values in the list by 1,
            # current now points at the next interesting value.
          else:
            current += 1 # We didn't merge, so we must increment current
    

    这也给出了
    final==['chr3:50:90','chr1:5:90','chr1:120:180']

    Hmmm,我不清楚您是如何从示例输入数据到输出的。你能更详细地描述一下算法吗?一些代码(即使效率很低)也会帮助我理解这一点。@ SAMMISSMAN,你应该把数字看成是整数范围的较低的和包含的上限,OP希望每个染色体的非冗余范围限制(CHR)的一种方式是:当你通过列表时,使用一个以染色体为关键字的字典,还有一个长度为2的列表,记录最小值和最大值,然后遍历字典并重新格式化。您是否缺少前一个由当前封装,反之亦然的两种情况?此外,如果您更新了现有范围,并且更新后的范围现在涵盖了以前未涵盖的范围。。。这真的很重要吗?@deinonychusaur已经在当前封装之前对其进行了修复,但您正确地认为,例如
    [(0,10),(15,25),(5,20)]
    会给出
    [(0,20),(15,25)]
    ,这不是期望的结果。。。我会考虑一下。我认为当它们不重叠时,检查它们实际上更容易/更快,当两个合并时,必须再次检查它们。所以为了提高速度,我想在添加时先进行合并,然后进行迭代,直到不再进行合并。这似乎是最明智的做法approach@deinonychusaur我想就是这样,谢谢你的建议。排序当然会让事情变得更整洁。@Coryza:如果你真的想要速度,我不会使用Python.:-)我预计这将是Python中最快的速度,尽管通过并行合并可以获得更多收益。
    processed == defaultdict(<class 'list'>, 
                             {'chr3': [[50, 90]], 
                              'chr1': [[10, 90], [120, 180]]})
    
    final = []
    for key, vals in processed.items():
        for start, end in vals:
            final.append(":".join(map(str, (key, start, end))))
    
    final == ['chr3:50:90', 'chr1:10:90', 'chr1:120:180']
    
    from collections import defaultdict
    
    processed = defaultdict(list)
    
    for s in data:
        chr_, start, end = s.split(":")
        processed[chr_].append((int(start), int(end)))
    
    for vals in processed.values():
        vals.sort()
        current = 1
        while current < len(vals):
          if vals[current-1][1] > vals[current][0]:
            # current and previous ranges overlap, so merge previous and current values.
            vals[current-1:current+1] = [(vals[current-1][0], vals[current][1])]
            # Because we reduced the number of values in the list by 1,
            # current now points at the next interesting value.
          else:
            current += 1 # We didn't merge, so we must increment current
    
    final = []
    for key, vals in processed.items():
        for start, end in vals:
            final.append("%s:%s:%s" % (key, str(start), str(end)))