Warning: file_get_contents(/data/phpspider/zhask/data//catemap/9/csharp-4.0/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
在这种情况下,如何使用python优化集合匹配?_Python_Multiple Columns - Fatal编程技术网

在这种情况下,如何使用python优化集合匹配?

在这种情况下,如何使用python优化集合匹配?,python,multiple-columns,Python,Multiple Columns,我有一个包含两列“脚手架”的文本文件,如下所示: scaffold1|size14662 scaffold1|size14662 scaffold1|size14662 scaffold2|size14565 scaffold1|size14662 scaffold111160|size1478 scaffold2|size14565 scaffold2|size14565 scaffold2|size14565 scaffold1

我有一个包含两列“脚手架”的文本文件,如下所示:

scaffold1|size14662     scaffold1|size14662    
scaffold1|size14662     scaffold2|size14565    
scaffold1|size14662     scaffold111160|size1478
scaffold2|size14565     scaffold2|size14565    
scaffold2|size14565     scaffold1|size14662    
scaffold2|size14565     scaffold239623|size320 
scaffold3|size14436     scaffold3|size14436    
scaffold3|size14436     scaffold5|size13770    
scaffold3|size14436     scaffold5|size13770    
scaffold3|size14436     scaffold149|size9055   
scaffold4|size14291     scaffold4|size14291    
scaffold4|size14291     scaffold32275|size3028 
scaffold4|size14291     scaffold66288|size2175 
scaffold5|size13770     scaffold5|size13770    
scaffold5|size13770     scaffold133|size9198   
scaffold5|size13770     scaffold149|size9055   
scaffold6|size13181     scaffold6|size13181    
scaffold6|size13181     scaffold92|size9644    
scaffold6|size13181     scaffold113496|size1447
scaffold7|size13167     scaffold7|size13167    
scaffold1|size14662
scaffold2|size14565
scaffold111160|size1478
scaffold239623|size320
---
scaffold7|size13167
---
scaffold5|size13770
scaffold3|size14436
scaffold149|size9055
scaffold133|size9198
---
scaffold92|size9644
scaffold113496|size1447
scaffold6|size13181
---
scaffold32275|size3028
scaffold66288|size2175
scaffold4|size14291
右栏中的“脚手架”与左栏中相应的“脚手架”是“匹配”(如“是相同的东西”),例如:

[scaffold1|size14662, scaffold2|size14565, scaffold111160|size1478]
右栏中的尺寸与左栏中的
scaffold1 | size14662
相同

我需要从这个文件中获取一个列表(不是python列表,只是一个列表),其中包含所有匹配的支架集,如下所示:

scaffold1|size14662     scaffold1|size14662    
scaffold1|size14662     scaffold2|size14565    
scaffold1|size14662     scaffold111160|size1478
scaffold2|size14565     scaffold2|size14565    
scaffold2|size14565     scaffold1|size14662    
scaffold2|size14565     scaffold239623|size320 
scaffold3|size14436     scaffold3|size14436    
scaffold3|size14436     scaffold5|size13770    
scaffold3|size14436     scaffold5|size13770    
scaffold3|size14436     scaffold149|size9055   
scaffold4|size14291     scaffold4|size14291    
scaffold4|size14291     scaffold32275|size3028 
scaffold4|size14291     scaffold66288|size2175 
scaffold5|size13770     scaffold5|size13770    
scaffold5|size13770     scaffold133|size9198   
scaffold5|size13770     scaffold149|size9055   
scaffold6|size13181     scaffold6|size13181    
scaffold6|size13181     scaffold92|size9644    
scaffold6|size13181     scaffold113496|size1447
scaffold7|size13167     scaffold7|size13167    
scaffold1|size14662
scaffold2|size14565
scaffold111160|size1478
scaffold239623|size320
---
scaffold7|size13167
---
scaffold5|size13770
scaffold3|size14436
scaffold149|size9055
scaffold133|size9198
---
scaffold92|size9644
scaffold113496|size1447
scaffold6|size13181
---
scaffold32275|size3028
scaffold66288|size2175
scaffold4|size14291
我能够生成一些这样做的代码,但是它非常慢,因为它会一遍又一遍地遍历同一个列表。因为我使用的是一个大约有2百万行的文件,所以这不是一个好的解决方案

rawscafs = open ("columnfile")

scafs={}
for line in rawscafs:
    cont = 0
    splitvalues=line.split()
    for k,v in scafs.items():
        if splitvalues[1] in v:
            cont = 1
        elif splitvalues[0] in v:
            scafs[k].add(splitvalues[1])
            cont = 1
    if cont == 1:
        cont = 0
        continue       
    if splitvalues[0] in scafs:
        scafs[splitvalues[0]].add(splitvalues[1])
    else:
        scafs[splitvalues[0]] = set()
        scafs[splitvalues[0]].add(splitvalues[1])
rawscafs.close()


for key in scafs:
    for i in (scafs[key]):
        print(i+"\n")
    print("---\n")

rawscafs.close()
正如您所看到的,这是一个丑陋的代码,但我只是在寻找一个快速而肮脏的解决方案。我显然还没有找到。
有人能帮我优化这段代码吗(或者提供一个更简单的解决方案,因为我确信肯定有,我就是想不出来)。

感谢@DSM提供的指针!利用那里提供的信息,我能够找到问题的解决方案。这是:

#!/usr/bin/python3

infile = open('columnfile','r')

title = ""
scaf = set()
scafs = []
for lines in infile:
    lines = lines.split()
    if lines[0] != title:
        title = lines[0]
        scafs.append(scaf)
        scaf = set()
        scaf.add(lines[1])
    else:
        scaf.add(lines[1])

scafs.append(scafs)
del scafs[0]
del scafs[-1]

infile.close()

def consolidate(sets):
    setlist = [s for s in sets if s]
    for i, s1 in enumerate(setlist):
        if s1:
            for s2 in setlist[i+1:]:
                intersection = s1.intersection(s2)
                if intersection:
                    s2.update(s1)
                    s1.clear()
                    s1 = s2
    return [s for s in setlist if s]


for i in consolidate(scafs):
    for a in i:
        print(a)
    print("---")

是的,我知道它仍然是丑陋的代码,但现在它做了我需要它做的事情。一旦我将其插入到程序中,它肯定会看起来更好。

如果我理解正确,这被称为,并且可以被视为一个。这里有一些实现,您可以看到一些实现上的定时测试。我自己在第一个链接中使用了迭代版本。谢谢@DSM!你的指针让我找到了解决办法!