Python 套盖或击球套;Numpy,组成完整集合的最少元素组合

Python 套盖或击球套;Numpy,组成完整集合的最少元素组合,python,algorithm,numpy,set,combinations,Python,Algorithm,Numpy,Set,Combinations,我的目标是找到尽可能少的子集合[a-f]来组成完整的集合a A = set([1,2,3,4,5,6,7,8,9,10]) # full set #--- below are sub sets of A --- a = set([1,2]) b = set([1,2,3]) c = set([1,2,3,4]) d = set([4,5,6,7]) e = set([7,8,9]) f = set([5,8,9,10]) 实际上,我正在处理的父集合A包含15k个唯一元素,有30k个子集合

我的目标是找到尽可能少的子集合[a-f]来组成完整的集合a

A = set([1,2,3,4,5,6,7,8,9,10]) # full set


#--- below are sub sets of A ---

a = set([1,2])
b = set([1,2,3])
c = set([1,2,3,4])
d = set([4,5,6,7])
e = set([7,8,9])
f = set([5,8,9,10])
实际上,我正在处理的父集合A包含15k个唯一元素,有30k个子集合,这些子集合的长度范围从单个唯一元素到1.5k个唯一元素

到目前为止,我正在使用的代码看起来或多或少像以下代码,速度非常慢:

import random


B = {'a': a, 'b': b, 'c': c, 'd': d, 'e': e, 'f': f}
Bx = B.keys()
random.shuffle(Bx)

Dict = {}

for i in Bx: # iterate through shuffled keys.
    z = [i]
    x = B[i]
    L = len(x)

    while L < len(A):
        for ii in Bx:
            x = x | B[ii]
            Lx = len(x)
            if Lx > L:
                L = Lx
                z.append(ii)

    try:
        Dict[len(z)].append(z)
    except KeyError:
        Dict[len(z)] = [z]

print Dict[min(Dict.keys()]
随机导入
B={'a':a,'B':B,'c':c,'d':d,'e':e,'f':f}
Bx=B.键()
随机洗牌(Bx)
Dict={}
对于Bx中的i:#遍历无序键。
z=[i]
x=B[i]
L=len(x)
而LL:
L=Lx
z、 附加(二)
尝试:
Dict[len(z)].追加(z)
除KeyError外:
Dict[len(z)]=[z]
打印指令[min(指令键()]
这只是给出了我所采取的方法的一个想法。为了清晰起见,我省略了一些逻辑,可以最小化已经太大的集合上的迭代以及其他类似的东西


我想Numpy真的很擅长这类问题,但我想不出一种方法来使用它。

问题是要求实现,没有快速算法来找到最佳解决方案。然而,问题的贪婪解决方案是——反复选择包含最多元素的子集,而这些元素没有尚未覆盖-在合理的时间内做好工作

您可以在

编辑添加:@Aaron Hall可以通过使用下面的替代品来改进他的贪婪的集合覆盖例程。在Aaron的代码中,我们计算一个分数
len(s-result集合)
对于每个剩余的子集,每次我们想在封面上添加一个子集。但是,这个分数只会随着我们添加到结果集而降低;因此,如果在当前迭代中,我们选择了一个到目前为止最好的集,其分数高于在以前迭代中获得的剩余子集,我们知道他们的分数无法提高,只能提高这建议使用优先级队列来存储要处理的子集;在python中,我们可以通过
heapq
实现这一想法:

# at top of file
import heapq
#... etc

# replace greedy_set_cover
@timer
def greedy_set_cover(subsets, parent_set):
    parent_set = set(parent_set)
    max = len(parent_set)
    # create the initial heap. Note 'subsets' can be unsorted,
    # so this is independent of whether remove_redunant_subsets is used.
    heap = []
    for s in subsets:
        # Python's heapq lets you pop the *smallest* value, so we
        # want to use max-len(s) as a score, not len(s).
        # len(heap) is just proving a unique number to each subset,
        # used to tiebreak equal scores.
        heapq.heappush(heap, [max-len(s), len(heap), s])
    results = []
    result_set = set()
    while result_set < parent_set:
        logging.debug('len of result_set is {0}'.format(len(result_set)))
        best = []
        unused = []
        while heap:
            score, count, s = heapq.heappop(heap)
            if not best:
                best = [max-len(s - result_set), count, s]
                continue
            if score >= best[0]:
                # because subset scores only get worse as the resultset
                # gets bigger, we know that the rest of the heap cannot beat
                # the best score. So push the subset back on the heap, and
                # stop this iteration.
                heapq.heappush(heap, [score, count, s])
                break
            score = max-len(s - result_set)
            if score >= best[0]:
                unused.append([score, count, s])
            else:
                unused.append(best)
                best = [score, count, s]
        add_set = best[2]
        logging.debug('len of add_set is {0} score was {1}'.format(len(add_set), best[0]))
        results.append(add_set)
        result_set.update(add_set)
        # subsets that were not the best get put back on the heap for next time.
        while unused:
            heapq.heappush(heap, unused.pop())
    return results
这是上面代码的计时,快了3倍多一点

INFO:root:make_subsets function took 15674.409 ms
INFO:root:len of union of all subsets was 15000
INFO:root:include_complement function took 461.027 ms
INFO:root:greedy_pq_set_cover function took 8896.885 ms
INFO:root:len of results is 46
注意:这两种算法以不同的顺序处理子集,并且偶尔会对集合覆盖的大小给出不同的答案;这取决于分数相等时子集的“幸运”选择

优先级队列/堆是贪心算法的一个众所周知的优化,尽管我找不到关于这一点的合适的讨论链接


虽然贪婪算法是一种快速获得近似答案的方法,但您可以在事后花费时间来改进答案,因为您知道我们在最小集合覆盖上有一个上界。实现这一点的技术包括模拟退火或分支定界算法,如图所示。这里有一个使用
itertools.com的解决方案二进制
迭代子集的各种组合,以及
联合(*x)
组合它们

import itertools
subsets = [a,b,c,d,e,f]
def foo(A, subsets):
    found = []
    for n in range(2,len(subsets)):
        for x in itertools.combinations(subsets, n):
            u =  set().union(*x)
            if A==u:
                found.append(x)
        if found:
            break
    return found
print foo(A,subsets)
产生:

[(set([1, 2, 3]), set([4, 5, 6, 7]), set([8, 9, 10, 5])), 
 (set([1, 2, 3, 4]), set([4, 5, 6, 7]), set([8, 9, 10, 5]))]
对于本例,它的运行速度比您的代码快一点,但如果我将其展开以跟踪子集名称,它的运行速度会慢一点。但这是一个小示例,因此计时并不意味着太多。(编辑-如另一个答案中所示,这种方法会因更大的问题而大大减慢)

numpy
没有任何帮助,因为我们没有处理数组或并行操作。正如其他人所说,这基本上是一个搜索问题。你可以加快内部步骤,并尝试删除死角,但你无法避免尝试许多替代方法


numpy
中进行搜索的通常方法是构造一个包含所有组合的矩阵,然后用sum、min或max之类的值提取所需的组合。这是一种蛮力方法,利用数组上的快速编译操作。

感谢这个问题,我发现它非常有趣。我已经测试了下面的代码在Python2.6、2.7和3.3上,您可能会发现自己运行它很有趣,我使它易于粘贴到解释器或作为脚本运行

这里的另一个解决方案试图通过暴力解决,即通过每一个可能的组合,这可能对十个元素都是可行的,提问者给出了一个例子,但不会为提问者要求的参数提供解决方案,即选择子集的组合(多达1500个元素长,来自15000个元素的超集)来自30000个集合。我发现对于这些参数,试图找到一个解集,其中n=40(非常不可能)意味着在一个googol上搜索多个组合顺序,这是非常不可能的

设置 在这里,我导入了一些用于对函数进行基准测试和创建数据的模块。我还创建了一个计时器装饰器来包装函数,以便可以轻松测量函数完成之前经过的时间(或者我放弃并中断函数)

数据创建功能 接下来,我必须创建数据:

@timer
def make_subsets(parent_set, n):
    '''create list of subset sets, takes about 17 secs'''
    subsets = []
    for i in range(n): # use xrange in python 2
        subsets.append(set(random.sample(parent_set, random.randint(1, MAX_SUBSET_SIZE))))
    return subsets


@timer
def include_complement(parent_set, subsets):
    '''ensure no missing elements from parent, since collected randomly'''
    union_subsets = set().union(*subsets)
    subsets_complement = set(parent_set) - union_subsets
    logging.info('len of union of all subsets was {0}'.format(
                                          len(union_subsets)))
    if subsets_complement:
        logging.info('len of subsets_complement was {0}'.format(
                                          len(subsets_complement)))
        subsets.append(subsets_complement)
    return subsets
可选预处理 我提供了一些预处理,它在几秒钟内运行,但没有多大帮助,只加快了几分之一秒,但记录在这里供读者启发:

@timer
def remove_redundant_subsets(subsets):
    '''
    without break, takes a while, removes 81 sets of len <= 4 (seed(0))
    in 5.5 minutes, so breaking at len 10 for 4 second completion.
    probably unnecessary if truly random subsets
    but *may* be good if large subsets are subsets of others.
    '''
    subsets.sort(key=len)
    remove_list = []
    for index, s in enumerate(subsets, 1):
        if len(s) > 10: # possible gain not worth continuing farther
            break
        if any(s.issubset(other) for other in subsets[index:]):
            logging.debug('will remove subset: {s}'.format(s=s))
            remove_list.append(s)
    logging.info('subsets removing: {0}'.format(len(remove_list)))
    for s in remove_list:
        subsets.remove(s)
    return subsets
最终结果 这提供了46个(ish)子集的最终结果,根据提问者给出的原始参数,在Python 2中运行

以下是种子(0)的输出:

下面是种子(1)的输出:

这很有趣,谢谢你的问题

PS:我决定尝试测试naive brute force方法:

INFO:root:make_subsets function took 17984.412 ms
INFO:root:len of union of all subsets was 15000
INFO:root:include_complement function took 2412.666 ms
INFO:root:foo function interrupted after 3269064.913 ms
很自然地,我打断了它,因为在我的有生之年,也许在我们太阳的有生之年,它永远不会靠近我:

>>> import math
>>> def combinations(n, k):
...     return math.factorial(n)/(math.factorial(k)*math.factorial(n-k))
... 
>>> combinations(30000, 40)
145180572634248196249221943251413238587572515214068555166193044430231638603286783165583063264869418810588422212955938270891601399250L
“子集合”上的元素不必在
A
集合右侧?或者是guaran
@timer
def greedy_set_cover(subsets, parent_set):
    parent_set = set(parent_set)
    results = []
    result_set = set()
    while result_set < parent_set:
        logging.debug('len of result_set is {0}'.format(len(result_set)))
        # maybe room for optimization here: Will still have to calculate.
        # But custom max could shortcut subsets on uncovered more than len.
        add_set = max(subsets, key=lambda x: len(x - result_set))
        logging.debug('len of add_set is {0}'.format(len(add_set)))
        results.append(add_set)
        result_set.update(add_set)
    return results
# full set, use xrange instead of range in python 2 for space efficiency    
parent_set = range(PARENT_SIZE) 
subsets = make_subsets(parent_set, N_SUBSETS)
logging.debug(len(subsets))
subsets = include_complement(parent_set, subsets) # if necessary
logging.debug(len(subsets))
subsets = remove_redundant_subsets(subsets)
logging.debug(len(subsets))
results = greedy_set_cover(subsets, parent_set)
logging.info('len of results is {0}'.format(len(results)))
for i, set in enumerate(results, 1):
    logging.debug('len of set {0} is {1}'.format(i, len(set)))
INFO:root:make_subsets function took 17158.725 ms
INFO:root:len of union of all subsets was 15000
INFO:root:include_complement function took 2716.381 ms
INFO:root:subsets removing: 81
INFO:root:remove_redundant_subsets function took 3319.620 ms
INFO:root:greedy_set_cover function took 188026.052 ms
INFO:root:len of results is 46
INFO:root:make_subsets function took 17538.083 ms
INFO:root:len of union of all subsets was 15000
INFO:root:include_complement function took 2414.091 ms
INFO:root:subsets removing: 68
INFO:root:remove_redundant_subsets function took 3218.643 ms
INFO:root:greedy_set_cover function took 189019.275 ms
INFO:root:len of results is 47
INFO:root:make_subsets function took 17984.412 ms
INFO:root:len of union of all subsets was 15000
INFO:root:include_complement function took 2412.666 ms
INFO:root:foo function interrupted after 3269064.913 ms
>>> import math
>>> def combinations(n, k):
...     return math.factorial(n)/(math.factorial(k)*math.factorial(n-k))
... 
>>> combinations(30000, 40)
145180572634248196249221943251413238587572515214068555166193044430231638603286783165583063264869418810588422212955938270891601399250L