Python 合并共享公共元素的列表
我的输入是一个列表列表。他们中的一些人有共同的因素,例如Python 合并共享公共元素的列表,python,list,merge,boolean-expression,connected-components,Python,List,Merge,Boolean Expression,Connected Components,我的输入是一个列表列表。他们中的一些人有共同的因素,例如 L = [['a','b','c'],['b','d','e'],['k'],['o','p'],['e','f'],['p','a'],['d','g']] 我需要合并共享一个公共元素的所有列表,并重复此过程,只要没有其他列表包含同一项。我曾考虑过使用布尔运算和while循环,但没有找到一个好的解决方案 最终结果应该是: L = [['a','b','c','d','e','f','g','o','p'],['k']] 不知道你想
L = [['a','b','c'],['b','d','e'],['k'],['o','p'],['e','f'],['p','a'],['d','g']]
我需要合并共享一个公共元素的所有列表,并重复此过程,只要没有其他列表包含同一项。我曾考虑过使用布尔运算和while循环,但没有找到一个好的解决方案
最终结果应该是:
L = [['a','b','c','d','e','f','g','o','p'],['k']]
不知道你想要什么,我决定猜测你的意思:我只想找到每一个元素一次
#!/usr/bin/python
def clink(l, acc):
for sub in l:
if sub.__class__ == list:
clink(sub, acc)
else:
acc[sub]=1
def clunk(l):
acc = {}
clink(l, acc)
print acc.keys()
l = [['a', 'b', 'c'], ['b', 'd', 'e'], ['k'], ['o', 'p'], ['e', 'f'], ['p', 'a'], ['d', 'g']]
clunk(l)
输出如下所示:
['a', 'c', 'b', 'e', 'd', 'g', 'f', 'k', 'o', 'p']
算法:
l = [['a', 'b', 'c'], ['b', 'd', 'e'], ['k'], ['o', 'p'], ['e', 'f'], ['p', 'a'], ['d', 'g']]
out = []
while len(l)>0:
first, *rest = l
first = set(first)
lf = -1
while len(first)>lf:
lf = len(first)
rest2 = []
for r in rest:
if len(first.intersection(set(r)))>0:
first |= set(r)
else:
rest2.append(r)
rest = rest2
out.append(first)
l = rest
print(out)
我认为这可以通过将问题建模为一个整体来解决。每个子列表都是一个节点,仅当两个子列表具有某些共同元素时,才与另一个节点共享一条边。因此,合并的子列表基本上是图中的一个元素。合并所有组件只需找到所有连接的组件并列出它们 这可以通过在图上进行简单的遍历来完成。和都可以使用,但我在这里使用DFS,因为它对我来说比较短
l = [['a','b','c'],['b','d','e'],['k'],['o','p'],['e','f'],['p','a'],['d','g']]
taken=[False]*len(l)
l=[set(elem) for elem in l]
def dfs(node,index):
taken[index]=True
ret=node
for i,item in enumerate(l):
if not taken[i] and not ret.isdisjoint(item):
ret.update(dfs(item,i))
return ret
def merge_all():
ret=[]
for i,node in enumerate(l):
if not taken[i]:
ret.append(list(dfs(node,i)))
return ret
print(merge_all())
您可以将列表视为图形的符号,即
['a','b','c']
是一个有3个节点相互连接的图形。你试图解决的问题是找到答案
您可以使用它,它的优点是几乎可以保证它是正确的:
l = [['a','b','c'],['b','d','e'],['k'],['o','p'],['e','f'],['p','a'],['d','g']]
import networkx
from networkx.algorithms.components.connected import connected_components
def to_graph(l):
G = networkx.Graph()
for part in l:
# each sublist is a bunch of nodes
G.add_nodes_from(part)
# it also imlies a number of edges:
G.add_edges_from(to_edges(part))
return G
def to_edges(l):
"""
treat `l` as a Graph and returns it's edges
to_edges(['a','b','c','d']) -> [(a,b), (b,c),(c,d)]
"""
it = iter(l)
last = next(it)
for current in it:
yield last, current
last = current
G = to_graph(l)
print connected_components(G)
# prints [['a', 'c', 'b', 'e', 'd', 'g', 'f', 'o', 'p'], ['k']]
为了自己有效地解决这个问题,您必须将列表转换为图形化的内容,因此您最好从一开始就使用networkX。我遇到了相同的问题,即尝试将列表与公共值合并。这个例子可能就是你想要的。 它只在列表上循环一次,并在运行时更新resultset
lists = [['a','b','c'],['b','d','e'],['k'],['o','p'],['e','f'],['p','a'],['d','g']]
lists = sorted([sorted(x) for x in lists]) #Sorts lists in place so you dont miss things. Trust me, needs to be done.
resultslist = [] #Create the empty result list.
if len(lists) >= 1: # If your list is empty then you dont need to do anything.
resultlist = [lists[0]] #Add the first item to your resultset
if len(lists) > 1: #If there is only one list in your list then you dont need to do anything.
for l in lists[1:]: #Loop through lists starting at list 1
listset = set(l) #Turn you list into a set
merged = False #Trigger
for index in range(len(resultlist)): #Use indexes of the list for speed.
rset = set(resultlist[index]) #Get list from you resultset as a set
if len(listset & rset) != 0: #If listset and rset have a common value then the len will be greater than 1
resultlist[index] = list(listset | rset) #Update the resultlist with the updated union of listset and rset
merged = True #Turn trigger to True
break #Because you found a match there is no need to continue the for loop.
if not merged: #If there was no match then add the list to the resultset, so it doesnt get left out.
resultlist.append(l)
print resultlist
#
我的尝试。有功能性的外观
#!/usr/bin/python
from collections import defaultdict
l = [['a','b','c'],['b','d','e'],['k'],['o','p'],['e','f'],['p','a'],['d','g']]
hashdict = defaultdict(int)
def hashit(x, y):
for i in y: x[i] += 1
return x
def merge(x, y):
sums = sum([hashdict[i] for i in y])
if sums > len(y):
x[0] = x[0].union(y)
else:
x[1] = x[1].union(y)
return x
hashdict = reduce(hashit, l, hashdict)
sets = reduce(merge, l, [set(),set()])
print [list(sets[0]), list(sets[1])]
当您在图形中查找连接的组件时。以下是如何在不使用图形库的情况下实现它:
from collections import defaultdict
def connected_components(lists):
neighbors = defaultdict(set)
seen = set()
for each in lists:
for item in each:
neighbors[item].update(each)
def component(node, neighbors=neighbors, seen=seen, see=seen.add):
nodes = set([node])
next_node = nodes.pop
while nodes:
node = next_node()
see(node)
nodes |= neighbors[node] - seen
yield node
for node in neighbors:
if node not in seen:
yield sorted(component(node))
L = [['a','b','c'],['b','d','e'],['k'],['o','p'],['e','f'],['p','a'],['d','g']]
print list(connected_components(L))
这可能是一个更简单/更快的算法,而且似乎工作得很好-
l = [['a', 'b', 'c'], ['b', 'd', 'e'], ['k'], ['o', 'p'], ['e', 'f'], ['p', 'a'], ['d', 'g']]
len_l = len(l)
i = 0
while i < (len_l - 1):
for j in range(i + 1, len_l):
# i,j iterate over all pairs of l's elements including new
# elements from merged pairs. We use len_l because len(l)
# may change as we iterate
i_set = set(l[i])
j_set = set(l[j])
if len(i_set.intersection(j_set)) > 0:
# Remove these two from list
l.pop(j)
l.pop(i)
# Merge them and append to the orig. list
ij_union = list(i_set.union(j_set))
l.append(ij_union)
# len(l) has changed
len_l -= 1
# adjust 'i' because elements shifted
i -= 1
# abort inner loop, continue with next l[i]
break
i += 1
print l
# prints [['k'], ['a', 'c', 'b', 'e', 'd', 'g', 'f', 'o', 'p']]
l=['a','b','c'],['b','d','e'],['k'],['o','p'],['e','f'],['p','a'],['d','g']]
len_l=len(l)
i=0
而我<(len_l-1):
对于范围内的j(i+1,len_l):
#i,j迭代所有l元素对,包括new
#合并对中的元素。我们使用len_l是因为len(l)
#可能会随着我们的迭代而改变
i_set=set(l[i])
j_集=集(l[j])
如果len(i_集.交集(j_集))>0:
#从列表中删除这两个
l、 流行音乐(j)
l、 流行音乐(一)
#合并它们并附加到源文件。列表
ij_并集=列表(i_集.并集(j_集))
l、 附加(ij_联合)
#莱恩(左)变了
len_l-=1
#调整“i”,因为元素已移动
i-=1
#中止内部循环,继续下一个l[i]
打破
i+=1
打印l
#印刷品['k'],['a',c',b',e',d',g',f',o',p']]
我发现itertools是合并列表的快速选项,它为我解决了这个问题:
import itertools
LL = set(itertools.chain.from_iterable(L))
# LL is {'a', 'b', 'c', 'd', 'e', 'f', 'g', 'k', 'o', 'p'}
for each in LL:
components = [x for x in L if each in x]
for i in components:
L.remove(i)
L += [list(set(itertools.chain.from_iterable(components)))]
# then L = [['k'], ['a', 'c', 'b', 'e', 'd', 'g', 'f', 'o', 'p']]
对于大型集合,按频率将LL从最常见的元素排序到最不常见的元素可以稍微加快速度我需要对相当大的列表执行OP数百万次描述的聚类技术,因此想要确定上面建议的哪种方法是最准确和最有效的 对于上述每个方法,我使用相同的输入列表,对大小从2^1到2^10的输入列表进行了10次试验,并测量了上述每个算法的平均运行时间(毫秒)。结果如下: 这些结果帮助我看到,在能够始终返回正确结果的方法中,@jochen's是最快的。在那些无法始终返回正确结果的方法中,mak的解决方案通常不包括所有输入元素(即缺少列表成员列表),braaksma、cmangla和asterisk的解决方案也不能保证最大程度地合并 有趣的是,两个最快、正确的算法以正确的顺序拥有最新的两个投票量 以下是用于运行测试的代码:
from networkx.algorithms.components.connected import connected_components
from itertools import chain
from random import randint, random
from collections import defaultdict, deque
from copy import deepcopy
from multiprocessing import Pool
import networkx
import datetime
import os
##
# @mimomu
##
def mimomu(l):
l = deepcopy(l)
s = set(chain.from_iterable(l))
for i in s:
components = [x for x in l if i in x]
for j in components:
l.remove(j)
l += [list(set(chain.from_iterable(components)))]
return l
##
# @Howard
##
def howard(l):
out = []
while len(l)>0:
first, *rest = l
first = set(first)
lf = -1
while len(first)>lf:
lf = len(first)
rest2 = []
for r in rest:
if len(first.intersection(set(r)))>0:
first |= set(r)
else:
rest2.append(r)
rest = rest2
out.append(first)
l = rest
return out
##
# Nx @Jochen Ritzel
##
def jochen(l):
l = deepcopy(l)
def to_graph(l):
G = networkx.Graph()
for part in l:
# each sublist is a bunch of nodes
G.add_nodes_from(part)
# it also imlies a number of edges:
G.add_edges_from(to_edges(part))
return G
def to_edges(l):
"""
treat `l` as a Graph and returns it's edges
to_edges(['a','b','c','d']) -> [(a,b), (b,c),(c,d)]
"""
it = iter(l)
last = next(it)
for current in it:
yield last, current
last = current
G = to_graph(l)
return list(connected_components(G))
##
# Merge all @MAK
##
def mak(l):
l = deepcopy(l)
taken=[False]*len(l)
l=map(set,l)
def dfs(node,index):
taken[index]=True
ret=node
for i,item in enumerate(l):
if not taken[i] and not ret.isdisjoint(item):
ret.update(dfs(item,i))
return ret
def merge_all():
ret=[]
for i,node in enumerate(l):
if not taken[i]:
ret.append(list(dfs(node,i)))
return ret
result = list(merge_all())
return result
##
# @cmangla
##
def cmangla(l):
l = deepcopy(l)
len_l = len(l)
i = 0
while i < (len_l - 1):
for j in range(i + 1, len_l):
# i,j iterate over all pairs of l's elements including new
# elements from merged pairs. We use len_l because len(l)
# may change as we iterate
i_set = set(l[i])
j_set = set(l[j])
if len(i_set.intersection(j_set)) > 0:
# Remove these two from list
l.pop(j)
l.pop(i)
# Merge them and append to the orig. list
ij_union = list(i_set.union(j_set))
l.append(ij_union)
# len(l) has changed
len_l -= 1
# adjust 'i' because elements shifted
i -= 1
# abort inner loop, continue with next l[i]
break
i += 1
return l
##
# @pillmuncher
##
def pillmuncher(l):
l = deepcopy(l)
def connected_components(lists):
neighbors = defaultdict(set)
seen = set()
for each in lists:
for item in each:
neighbors[item].update(each)
def component(node, neighbors=neighbors, seen=seen, see=seen.add):
nodes = set([node])
next_node = nodes.pop
while nodes:
node = next_node()
see(node)
nodes |= neighbors[node] - seen
yield node
for node in neighbors:
if node not in seen:
yield sorted(component(node))
return list(connected_components(l))
##
# @NicholasBraaksma
##
def braaksma(l):
l = deepcopy(l)
lists = sorted([sorted(x) for x in l]) #Sorts lists in place so you dont miss things. Trust me, needs to be done.
resultslist = [] #Create the empty result list.
if len(lists) >= 1: # If your list is empty then you dont need to do anything.
resultlist = [lists[0]] #Add the first item to your resultset
if len(lists) > 1: #If there is only one list in your list then you dont need to do anything.
for l in lists[1:]: #Loop through lists starting at list 1
listset = set(l) #Turn you list into a set
merged = False #Trigger
for index in range(len(resultlist)): #Use indexes of the list for speed.
rset = set(resultlist[index]) #Get list from you resultset as a set
if len(listset & rset) != 0: #If listset and rset have a common value then the len will be greater than 1
resultlist[index] = list(listset | rset) #Update the resultlist with the updated union of listset and rset
merged = True #Turn trigger to True
break #Because you found a match there is no need to continue the for loop.
if not merged: #If there was no match then add the list to the resultset, so it doesnt get left out.
resultlist.append(l)
return resultlist
##
# @Rumple Stiltskin
##
def stiltskin(l):
l = deepcopy(l)
hashdict = defaultdict(int)
def hashit(x, y):
for i in y: x[i] += 1
return x
def merge(x, y):
sums = sum([hashdict[i] for i in y])
if sums > len(y):
x[0] = x[0].union(y)
else:
x[1] = x[1].union(y)
return x
hashdict = reduce(hashit, l, hashdict)
sets = reduce(merge, l, [set(),set()])
return list(sets)
##
# @Asterisk
##
def asterisk(l):
l = deepcopy(l)
results = {}
for sm in ['min', 'max']:
sort_method = min if sm == 'min' else max
l = sorted(l, key=lambda x:sort_method(x))
queue = deque(l)
grouped = []
while len(queue) >= 2:
l1 = queue.popleft()
l2 = queue.popleft()
s1 = set(l1)
s2 = set(l2)
if s1 & s2:
queue.appendleft(s1 | s2)
else:
grouped.append(s1)
queue.appendleft(s2)
if queue:
grouped.append(queue.pop())
results[sm] = grouped
if len(results['min']) < len(results['max']):
return results['min']
return results['max']
##
# Validate no more clusters can be merged
##
def validate(output, L):
# validate all sublists are maximally merged
d = defaultdict(list)
for idx, i in enumerate(output):
for j in i:
d[j].append(i)
if any([len(i) > 1 for i in d.values()]):
return 'not maximally merged'
# validate all items in L are accounted for
all_items = set(chain.from_iterable(L))
accounted_items = set(chain.from_iterable(output))
if all_items != accounted_items:
return 'missing items'
# validate results are good
return 'true'
##
# Timers
##
def time(func, L):
start = datetime.datetime.now()
result = func(L)
delta = datetime.datetime.now() - start
return result, delta
##
# Function runner
##
def run_func(args):
func, L, input_size = args
results, elapsed = time(func, L)
validation_result = validate(results, L)
return func.__name__, input_size, elapsed, validation_result
##
# Main
##
all_results = defaultdict(lambda: defaultdict(list))
funcs = [mimomu, howard, jochen, mak, cmangla, braaksma, asterisk]
args = []
for trial in range(10):
for s in range(10):
input_size = 2**s
# get some random inputs to use for all trials at this size
L = []
for i in range(input_size):
sublist = []
for j in range(randint(5, 10)):
sublist.append(randint(0, 2**24))
L.append(sublist)
for i in funcs:
args.append([i, L, input_size])
pool = Pool()
for result in pool.imap(run_func, args):
func_name, input_size, elapsed, validation_result = result
all_results[func_name][input_size].append({
'time': elapsed,
'validation': validation_result,
})
# show the running time for the function at this input size
print(input_size, func_name, elapsed, validation_result)
pool.close()
pool.join()
# write the average of time trials at each size for each function
with open('times.tsv', 'w') as out:
for func in all_results:
validations = [i['validation'] for j in all_results[func] for i in all_results[func][j]]
linetype = 'incorrect results' if any([i != 'true' for i in validations]) else 'correct results'
for input_size in all_results[func]:
all_times = [i['time'].microseconds for i in all_results[func][input_size]]
avg_time = sum(all_times) / len(all_times)
out.write(func + '\t' + str(input_size) + '\t' + \
str(avg_time) + '\t' + linetype + '\n')
来自networkx.algorithms.components.connected导入连接的\u组件
来自itertools进口链
从随机导入randint,随机
从集合导入defaultdict,deque
从复制导入deepcopy
来自多处理导入池
导入networkx
导入日期时间
导入操作系统
##
#@mimomu
##
迪夫米莫姆(左):
l=深度复制(l)
s=集合(链自可数(l))
对于s中的i:
组件=[x代表l中的x,如果i代表x]
对于部件中的j:
l、 移除(j)
l+=[列表(集合(链从可编辑(组件)))]
返回l
##
#@Howard
##
霍华德(左):
out=[]
当len(l)>0时:
首先,*rest=l
第一个=设置(第一个)
lf=-1
而len(first)>lf:
lf=len(第一个)
rest2=[]
对于静止的r:
如果len(第一个交叉点(集合(r)))>0:
第一个|=集合(r)
其他:
rest2.append(r)
rest=rest2
out.append(第一个)
l=休息
返回
##
#Nx@Jochen Ritzel
##
戴夫·乔森(左):
l=深度复制(l)
def到_图(l):
G=networkx.Graph()
对于l部分:
#每个子列表都是一组节点
G.从(零件)添加节点
#它还包含许多边:
G.将边从添加到边(部分))
返回G
def至_边缘(l):
"""
将'l'视为图并返回其边
到_边(['a','b','c','d'])->[(a,b),(b,c),(c,d)]
"""
it=国际热核实验堆(l)
最后一个=下一个(it)
对于it中的电流:
产镧
from networkx.algorithms.components.connected import connected_components
from itertools import chain
from random import randint, random
from collections import defaultdict, deque
from copy import deepcopy
from multiprocessing import Pool
import networkx
import datetime
import os
##
# @mimomu
##
def mimomu(l):
l = deepcopy(l)
s = set(chain.from_iterable(l))
for i in s:
components = [x for x in l if i in x]
for j in components:
l.remove(j)
l += [list(set(chain.from_iterable(components)))]
return l
##
# @Howard
##
def howard(l):
out = []
while len(l)>0:
first, *rest = l
first = set(first)
lf = -1
while len(first)>lf:
lf = len(first)
rest2 = []
for r in rest:
if len(first.intersection(set(r)))>0:
first |= set(r)
else:
rest2.append(r)
rest = rest2
out.append(first)
l = rest
return out
##
# Nx @Jochen Ritzel
##
def jochen(l):
l = deepcopy(l)
def to_graph(l):
G = networkx.Graph()
for part in l:
# each sublist is a bunch of nodes
G.add_nodes_from(part)
# it also imlies a number of edges:
G.add_edges_from(to_edges(part))
return G
def to_edges(l):
"""
treat `l` as a Graph and returns it's edges
to_edges(['a','b','c','d']) -> [(a,b), (b,c),(c,d)]
"""
it = iter(l)
last = next(it)
for current in it:
yield last, current
last = current
G = to_graph(l)
return list(connected_components(G))
##
# Merge all @MAK
##
def mak(l):
l = deepcopy(l)
taken=[False]*len(l)
l=map(set,l)
def dfs(node,index):
taken[index]=True
ret=node
for i,item in enumerate(l):
if not taken[i] and not ret.isdisjoint(item):
ret.update(dfs(item,i))
return ret
def merge_all():
ret=[]
for i,node in enumerate(l):
if not taken[i]:
ret.append(list(dfs(node,i)))
return ret
result = list(merge_all())
return result
##
# @cmangla
##
def cmangla(l):
l = deepcopy(l)
len_l = len(l)
i = 0
while i < (len_l - 1):
for j in range(i + 1, len_l):
# i,j iterate over all pairs of l's elements including new
# elements from merged pairs. We use len_l because len(l)
# may change as we iterate
i_set = set(l[i])
j_set = set(l[j])
if len(i_set.intersection(j_set)) > 0:
# Remove these two from list
l.pop(j)
l.pop(i)
# Merge them and append to the orig. list
ij_union = list(i_set.union(j_set))
l.append(ij_union)
# len(l) has changed
len_l -= 1
# adjust 'i' because elements shifted
i -= 1
# abort inner loop, continue with next l[i]
break
i += 1
return l
##
# @pillmuncher
##
def pillmuncher(l):
l = deepcopy(l)
def connected_components(lists):
neighbors = defaultdict(set)
seen = set()
for each in lists:
for item in each:
neighbors[item].update(each)
def component(node, neighbors=neighbors, seen=seen, see=seen.add):
nodes = set([node])
next_node = nodes.pop
while nodes:
node = next_node()
see(node)
nodes |= neighbors[node] - seen
yield node
for node in neighbors:
if node not in seen:
yield sorted(component(node))
return list(connected_components(l))
##
# @NicholasBraaksma
##
def braaksma(l):
l = deepcopy(l)
lists = sorted([sorted(x) for x in l]) #Sorts lists in place so you dont miss things. Trust me, needs to be done.
resultslist = [] #Create the empty result list.
if len(lists) >= 1: # If your list is empty then you dont need to do anything.
resultlist = [lists[0]] #Add the first item to your resultset
if len(lists) > 1: #If there is only one list in your list then you dont need to do anything.
for l in lists[1:]: #Loop through lists starting at list 1
listset = set(l) #Turn you list into a set
merged = False #Trigger
for index in range(len(resultlist)): #Use indexes of the list for speed.
rset = set(resultlist[index]) #Get list from you resultset as a set
if len(listset & rset) != 0: #If listset and rset have a common value then the len will be greater than 1
resultlist[index] = list(listset | rset) #Update the resultlist with the updated union of listset and rset
merged = True #Turn trigger to True
break #Because you found a match there is no need to continue the for loop.
if not merged: #If there was no match then add the list to the resultset, so it doesnt get left out.
resultlist.append(l)
return resultlist
##
# @Rumple Stiltskin
##
def stiltskin(l):
l = deepcopy(l)
hashdict = defaultdict(int)
def hashit(x, y):
for i in y: x[i] += 1
return x
def merge(x, y):
sums = sum([hashdict[i] for i in y])
if sums > len(y):
x[0] = x[0].union(y)
else:
x[1] = x[1].union(y)
return x
hashdict = reduce(hashit, l, hashdict)
sets = reduce(merge, l, [set(),set()])
return list(sets)
##
# @Asterisk
##
def asterisk(l):
l = deepcopy(l)
results = {}
for sm in ['min', 'max']:
sort_method = min if sm == 'min' else max
l = sorted(l, key=lambda x:sort_method(x))
queue = deque(l)
grouped = []
while len(queue) >= 2:
l1 = queue.popleft()
l2 = queue.popleft()
s1 = set(l1)
s2 = set(l2)
if s1 & s2:
queue.appendleft(s1 | s2)
else:
grouped.append(s1)
queue.appendleft(s2)
if queue:
grouped.append(queue.pop())
results[sm] = grouped
if len(results['min']) < len(results['max']):
return results['min']
return results['max']
##
# Validate no more clusters can be merged
##
def validate(output, L):
# validate all sublists are maximally merged
d = defaultdict(list)
for idx, i in enumerate(output):
for j in i:
d[j].append(i)
if any([len(i) > 1 for i in d.values()]):
return 'not maximally merged'
# validate all items in L are accounted for
all_items = set(chain.from_iterable(L))
accounted_items = set(chain.from_iterable(output))
if all_items != accounted_items:
return 'missing items'
# validate results are good
return 'true'
##
# Timers
##
def time(func, L):
start = datetime.datetime.now()
result = func(L)
delta = datetime.datetime.now() - start
return result, delta
##
# Function runner
##
def run_func(args):
func, L, input_size = args
results, elapsed = time(func, L)
validation_result = validate(results, L)
return func.__name__, input_size, elapsed, validation_result
##
# Main
##
all_results = defaultdict(lambda: defaultdict(list))
funcs = [mimomu, howard, jochen, mak, cmangla, braaksma, asterisk]
args = []
for trial in range(10):
for s in range(10):
input_size = 2**s
# get some random inputs to use for all trials at this size
L = []
for i in range(input_size):
sublist = []
for j in range(randint(5, 10)):
sublist.append(randint(0, 2**24))
L.append(sublist)
for i in funcs:
args.append([i, L, input_size])
pool = Pool()
for result in pool.imap(run_func, args):
func_name, input_size, elapsed, validation_result = result
all_results[func_name][input_size].append({
'time': elapsed,
'validation': validation_result,
})
# show the running time for the function at this input size
print(input_size, func_name, elapsed, validation_result)
pool.close()
pool.join()
# write the average of time trials at each size for each function
with open('times.tsv', 'w') as out:
for func in all_results:
validations = [i['validation'] for j in all_results[func] for i in all_results[func][j]]
linetype = 'incorrect results' if any([i != 'true' for i in validations]) else 'correct results'
for input_size in all_results[func]:
all_times = [i['time'].microseconds for i in all_results[func][input_size]]
avg_time = sum(all_times) / len(all_times)
out.write(func + '\t' + str(input_size) + '\t' + \
str(avg_time) + '\t' + linetype + '\n')
library(ggplot2)
df <- read.table('times.tsv', sep='\t')
p <- ggplot(df, aes(x=V2, y=V3, color=as.factor(V1))) +
geom_line() +
xlab('number of input lists') +
ylab('runtime (ms)') +
labs(color='') +
scale_x_continuous(trans='log10') +
facet_wrap(~V4, ncol=1)
ggsave('runtimes.png')
def merge_overlapping_sublists(lst):
output, refs = {}, {}
for index, sublist in enumerate(lst):
output[index] = set(sublist)
for elem in sublist:
refs[elem] = index
changes = True
while changes:
changes = False
for ref_num, sublist in list(output.items()):
for elem in sublist:
current_ref_num = refs[elem]
if current_ref_num != ref_num:
changes = True
output[current_ref_num] |= sublist
for elem2 in sublist:
refs[elem2] = current_ref_num
output.pop(ref_num)
break
return list(output.values())
def compare(a, b):
a = list(b)
try:
for elem in a:
b.remove(elem)
except ValueError:
return False
return not b
import random
lst = [["a", "b"], ["b", "c"], ["c", "d"], ["d", "e"]]
random.shuffle(lst)
assert compare(merge_overlapping_sublists(lst), [{"a", "b", "c", "d", "e"}])
lst = [["a", "b"], ["b", "c"], ["f", "d"], ["d", "e"]]
random.shuffle(lst)
assert compare(merge_overlapping_sublists(lst), [{"a", "b", "c",}, {"d", "e", "f"}])
lst = [["a", "b"], ["k", "c"], ["f", "g"], ["d", "e"]]
random.shuffle(lst)
assert compare(merge_overlapping_sublists(lst), [{"a", "b"}, {"k", "c"}, {"f", "g"}, {"d", "e"}])
lst = [["a", "b", "c"], ["b", "d", "e"], ["k"], ["o", "p"], ["e", "f"], ["p", "a"], ["d", "g"]]
random.shuffle(lst)
assert compare(merge_overlapping_sublists(lst), [{"k"}, {"a", "c", "b", "e", "d", "g", "f", "o", "p"}])
lst = [["a", "b"], ["b", "c"], ["a"], ["a"], ["b"]]
random.shuffle(lst)
assert compare(merge_overlapping_sublists(lst), [{"a", "b", "c"}])
#your list
l=[['a','b','c'],['b','d','e'],['k'],['o','p'],['e','f'],['p','a'],['d','g']]
#import itertools
from itertools import product, groupby
#inner lists to sets (to list of sets)
l=[set(x) for x in l]
#cartesian product merging elements if some element in common
for a,b in product(l,l):
if a.intersection( b ):
a.update(b)
b.update(a)
#back to list of lists
l = sorted( [sorted(list(x)) for x in l])
#remove dups
list(l for l,_ in groupby(l))
#result
[['a', 'b', 'c', 'd', 'e', 'f', 'g', 'o', 'p'], ['k']]
import networkx as nx
L = [['a','b','c'],['b','d','e'],['k'],['o','p'],['e','f'],['p','a'],['d','g']]
G = nx.Graph()
#Add nodes to Graph
G.add_nodes_from(sum(L, []))
#Create edges from list of nodes
q = [[(s[i],s[i+1]) for i in range(len(s)-1)] for s in L]
for i in q:
#Add edges to Graph
G.add_edges_from(i)
#Find all connnected components in graph and list nodes for each component
[list(i) for i in nx.connected_components(G)]
[['p', 'c', 'f', 'g', 'o', 'a', 'd', 'b', 'e'], ['k']]
def cluser_combine(groups):
n_groups=len(groups)
#first, we put all elements appeared in 'gruops' into 'elements'.
elements=list(set.union(*[set(g) for g in groups]))
#and sort elements.
elements.sort()
n_elements=len(elements)
#I create a list called clusters, this is the key of this algorithm.
#I was inspired by sklearn kmeans implementation.
#they have an attribute called labels_
#the url is here:
#https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html
#i called this algorithm cluster combine, because of this inspiration.
labels=list(range(n_elements))
#for each group, I get their 'indices' in 'elements'
#I then get the labels for indices.
#and i calculate the min of the labels, that will be the new label for them.
#I replace all elements with labels in labels_for_group with the new label.
#or to say, for each iteration,
#i try to combine two or more existing groups.
#if the group has labels of 0 and 2
#i find out the new label 0, that is the min of the two.
#i than replace them with 0.
for i in range(n_groups):
#if there is only zero/one element in the group, skip
if len(groups[i])<=1:
continue
indices=list(map(elements.index, groups[i]))
labels_for_group=list(set([labels[i] for i in indices]))
#if their is only one label, all the elements in group are already have the same label, skip.
if len(labels_for_group)==1:
continue
labels_for_group.sort()
label=labels_for_group[0]
#combine
for k in range(n_elements):
if labels[k] in labels_for_group[1:]:
labels[k]=label
new_groups=[]
for label in set(labels):
new_group = [elements[i] for i, v in enumerate(labels) if v == label]
new_groups.append(new_group)
return new_groups
cluser_combine([['a','b','c'],['b','d','e'],['k'],['o','p'],['e','f'],['p','a'],['d','g']])
orig = [['a','b','c'],['b','d','e'],['k'],['o','p'],['e','f'],['p','a'],['d','g'], ['k'],['k'],['k']]
def merge_lists(orig):
def step(orig):
mid = []
mid.append(orig[0])
for i in range(len(mid)):
for j in range(1,len(orig)):
for k in orig[j]:
if k in mid[i]:
mid[i].extend(orig[j])
break
elif k == orig[j][-1] and orig[j] not in mid:
mid.append(orig[j])
mid = [sorted(list(set(x))) for x in mid]
return mid
result = step(orig)
while result != step(result):
result = step(result)
return result
merge_lists(orig)
[['a', 'b', 'c', 'd', 'e', 'f', 'g', 'o', 'p'], ['k']]