在这种情况下,如何使用python优化集合匹配?
我有一个包含两列“脚手架”的文本文件,如下所示:在这种情况下,如何使用python优化集合匹配?,python,multiple-columns,Python,Multiple Columns,我有一个包含两列“脚手架”的文本文件,如下所示: scaffold1|size14662 scaffold1|size14662 scaffold1|size14662 scaffold2|size14565 scaffold1|size14662 scaffold111160|size1478 scaffold2|size14565 scaffold2|size14565 scaffold2|size14565 scaffold1
scaffold1|size14662 scaffold1|size14662
scaffold1|size14662 scaffold2|size14565
scaffold1|size14662 scaffold111160|size1478
scaffold2|size14565 scaffold2|size14565
scaffold2|size14565 scaffold1|size14662
scaffold2|size14565 scaffold239623|size320
scaffold3|size14436 scaffold3|size14436
scaffold3|size14436 scaffold5|size13770
scaffold3|size14436 scaffold5|size13770
scaffold3|size14436 scaffold149|size9055
scaffold4|size14291 scaffold4|size14291
scaffold4|size14291 scaffold32275|size3028
scaffold4|size14291 scaffold66288|size2175
scaffold5|size13770 scaffold5|size13770
scaffold5|size13770 scaffold133|size9198
scaffold5|size13770 scaffold149|size9055
scaffold6|size13181 scaffold6|size13181
scaffold6|size13181 scaffold92|size9644
scaffold6|size13181 scaffold113496|size1447
scaffold7|size13167 scaffold7|size13167
scaffold1|size14662
scaffold2|size14565
scaffold111160|size1478
scaffold239623|size320
---
scaffold7|size13167
---
scaffold5|size13770
scaffold3|size14436
scaffold149|size9055
scaffold133|size9198
---
scaffold92|size9644
scaffold113496|size1447
scaffold6|size13181
---
scaffold32275|size3028
scaffold66288|size2175
scaffold4|size14291
右栏中的“脚手架”与左栏中相应的“脚手架”是“匹配”(如“是相同的东西”),例如:
[scaffold1|size14662, scaffold2|size14565, scaffold111160|size1478]
右栏中的尺寸与左栏中的scaffold1 | size14662
相同
我需要从这个文件中获取一个列表(不是python列表,只是一个列表),其中包含所有匹配的支架集,如下所示:
scaffold1|size14662 scaffold1|size14662
scaffold1|size14662 scaffold2|size14565
scaffold1|size14662 scaffold111160|size1478
scaffold2|size14565 scaffold2|size14565
scaffold2|size14565 scaffold1|size14662
scaffold2|size14565 scaffold239623|size320
scaffold3|size14436 scaffold3|size14436
scaffold3|size14436 scaffold5|size13770
scaffold3|size14436 scaffold5|size13770
scaffold3|size14436 scaffold149|size9055
scaffold4|size14291 scaffold4|size14291
scaffold4|size14291 scaffold32275|size3028
scaffold4|size14291 scaffold66288|size2175
scaffold5|size13770 scaffold5|size13770
scaffold5|size13770 scaffold133|size9198
scaffold5|size13770 scaffold149|size9055
scaffold6|size13181 scaffold6|size13181
scaffold6|size13181 scaffold92|size9644
scaffold6|size13181 scaffold113496|size1447
scaffold7|size13167 scaffold7|size13167
scaffold1|size14662
scaffold2|size14565
scaffold111160|size1478
scaffold239623|size320
---
scaffold7|size13167
---
scaffold5|size13770
scaffold3|size14436
scaffold149|size9055
scaffold133|size9198
---
scaffold92|size9644
scaffold113496|size1447
scaffold6|size13181
---
scaffold32275|size3028
scaffold66288|size2175
scaffold4|size14291
我能够生成一些这样做的代码,但是它非常慢,因为它会一遍又一遍地遍历同一个列表。因为我使用的是一个大约有2百万行的文件,所以这不是一个好的解决方案
rawscafs = open ("columnfile")
scafs={}
for line in rawscafs:
cont = 0
splitvalues=line.split()
for k,v in scafs.items():
if splitvalues[1] in v:
cont = 1
elif splitvalues[0] in v:
scafs[k].add(splitvalues[1])
cont = 1
if cont == 1:
cont = 0
continue
if splitvalues[0] in scafs:
scafs[splitvalues[0]].add(splitvalues[1])
else:
scafs[splitvalues[0]] = set()
scafs[splitvalues[0]].add(splitvalues[1])
rawscafs.close()
for key in scafs:
for i in (scafs[key]):
print(i+"\n")
print("---\n")
rawscafs.close()
正如您所看到的,这是一个丑陋的代码,但我只是在寻找一个快速而肮脏的解决方案。我显然还没有找到。
有人能帮我优化这段代码吗(或者提供一个更简单的解决方案,因为我确信肯定有,我就是想不出来)。感谢@DSM提供的指针!利用那里提供的信息,我能够找到问题的解决方案。这是:
#!/usr/bin/python3
infile = open('columnfile','r')
title = ""
scaf = set()
scafs = []
for lines in infile:
lines = lines.split()
if lines[0] != title:
title = lines[0]
scafs.append(scaf)
scaf = set()
scaf.add(lines[1])
else:
scaf.add(lines[1])
scafs.append(scafs)
del scafs[0]
del scafs[-1]
infile.close()
def consolidate(sets):
setlist = [s for s in sets if s]
for i, s1 in enumerate(setlist):
if s1:
for s2 in setlist[i+1:]:
intersection = s1.intersection(s2)
if intersection:
s2.update(s1)
s1.clear()
s1 = s2
return [s for s in setlist if s]
for i in consolidate(scafs):
for a in i:
print(a)
print("---")
是的,我知道它仍然是丑陋的代码,但现在它做了我需要它做的事情。一旦我将其插入到程序中,它肯定会看起来更好。如果我理解正确,这被称为,并且可以被视为一个。这里有一些实现,您可以看到一些实现上的定时测试。我自己在第一个链接中使用了迭代版本。谢谢@DSM!你的指针让我找到了解决办法!