Python 套盖或击球套；Numpy，组成完整集合的最少元素组合_Python_Algorithm_Numpy_Set_Combinations

Python 套盖或击球套；Numpy，组成完整集合的最少元素组合

python algorithm numpy

Python 套盖或击球套；Numpy，组成完整集合的最少元素组合,python,algorithm,numpy,set,combinations,Python,Algorithm,Numpy,Set,Combinations,我的目标是找到尽可能少的子集合[a-f]来组成完整的集合a A = set([1,2,3,4,5,6,7,8,9,10]) # full set #--- below are sub sets of A --- a = set([1,2]) b = set([1,2,3]) c = set([1,2,3,4]) d = set([4,5,6,7]) e = set([7,8,9]) f = set([5,8,9,10]) 实际上，我正在处理的父集合A包含15k个唯一元素，有30k个子集合

我的目标是找到尽可能少的子集合[a-f]来组成完整的集合a

A = set([1,2,3,4,5,6,7,8,9,10]) # full set


#--- below are sub sets of A ---

a = set([1,2])
b = set([1,2,3])
c = set([1,2,3,4])
d = set([4,5,6,7])
e = set([7,8,9])
f = set([5,8,9,10])

实际上，我正在处理的父集合A包含15k个唯一元素，有30k个子集合，这些子集合的长度范围从单个唯一元素到1.5k个唯一元素

到目前为止，我正在使用的代码看起来或多或少像以下代码，速度非常慢：

import random


B = {'a': a, 'b': b, 'c': c, 'd': d, 'e': e, 'f': f}
Bx = B.keys()
random.shuffle(Bx)

Dict = {}

for i in Bx: # iterate through shuffled keys.
    z = [i]
    x = B[i]
    L = len(x)

    while L < len(A):
        for ii in Bx:
            x = x | B[ii]
            Lx = len(x)
            if Lx > L:
                L = Lx
                z.append(ii)

    try:
        Dict[len(z)].append(z)
    except KeyError:
        Dict[len(z)] = [z]

print Dict[min(Dict.keys()]

随机导入
B={'a'：a，'B'：B，'c'：c，'d'：d，'e'：e，'f'：f}
Bx=B.键（）
随机洗牌（Bx）
Dict={}
对于Bx中的i:#遍历无序键。
z=[i]
x=B[i]
L=len（x）
而LL：
L=Lx
z、 附加（二）
尝试：
Dict[len（z）].追加（z）
除KeyError外：
Dict[len（z）]=[z]
打印指令[min（指令键（）]

这只是给出了我所采取的方法的一个想法。为了清晰起见，我省略了一些逻辑，可以最小化已经太大的集合上的迭代以及其他类似的东西

我想Numpy真的很擅长这类问题，但我想不出一种方法来使用它。

问题是要求实现，没有快速算法来找到最佳解决方案。然而，问题的贪婪解决方案是——反复选择包含最多元素的子集，而这些元素没有尚未覆盖-在合理的时间内做好工作

您可以在

编辑添加：@Aaron Hall可以通过使用下面的替代品来改进他的贪婪的集合覆盖例程。在Aaron的代码中，我们计算一个分数

len（s-result集合）

对于每个剩余的子集，每次我们想在封面上添加一个子集。但是，这个分数只会随着我们添加到结果集而降低；因此，如果在当前迭代中，我们选择了一个到目前为止最好的集，其分数高于在以前迭代中获得的剩余子集，我们知道他们的分数无法提高，只能提高这建议使用优先级队列来存储要处理的子集；在python中，我们可以通过

heapq

实现这一想法：

# at top of file
import heapq
#... etc

# replace greedy_set_cover
@timer
def greedy_set_cover(subsets, parent_set):
    parent_set = set(parent_set)
    max = len(parent_set)
    # create the initial heap. Note 'subsets' can be unsorted,
    # so this is independent of whether remove_redunant_subsets is used.
    heap = []
    for s in subsets:
        # Python's heapq lets you pop the *smallest* value, so we
        # want to use max-len(s) as a score, not len(s).
        # len(heap) is just proving a unique number to each subset,
        # used to tiebreak equal scores.
        heapq.heappush(heap, [max-len(s), len(heap), s])
    results = []
    result_set = set()
    while result_set < parent_set:
        logging.debug('len of result_set is {0}'.format(len(result_set)))
        best = []
        unused = []
        while heap:
            score, count, s = heapq.heappop(heap)
            if not best:
                best = [max-len(s - result_set), count, s]
                continue
            if score >= best[0]:
                # because subset scores only get worse as the resultset
                # gets bigger, we know that the rest of the heap cannot beat
                # the best score. So push the subset back on the heap, and
                # stop this iteration.
                heapq.heappush(heap, [score, count, s])
                break
            score = max-len(s - result_set)
            if score >= best[0]:
                unused.append([score, count, s])
            else:
                unused.append(best)
                best = [score, count, s]
        add_set = best[2]
        logging.debug('len of add_set is {0} score was {1}'.format(len(add_set), best[0]))
        results.append(add_set)
        result_set.update(add_set)
        # subsets that were not the best get put back on the heap for next time.
        while unused:
            heapq.heappush(heap, unused.pop())
    return results

这是上面代码的计时，快了3倍多一点

INFO:root:make_subsets function took 15674.409 ms
INFO:root:len of union of all subsets was 15000
INFO:root:include_complement function took 461.027 ms
INFO:root:greedy_pq_set_cover function took 8896.885 ms
INFO:root:len of results is 46

注意：这两种算法以不同的顺序处理子集，并且偶尔会对集合覆盖的大小给出不同的答案；这取决于分数相等时子集的“幸运”选择

优先级队列/堆是贪心算法的一个众所周知的优化，尽管我找不到关于这一点的合适的讨论链接

虽然贪婪算法是一种快速获得近似答案的方法，但您可以在事后花费时间来改进答案，因为您知道我们在最小集合覆盖上有一个上界。实现这一点的技术包括模拟退火或分支定界算法，如图所示。这里有一个使用

itertools.com的解决方案二进制

迭代子集的各种组合，以及

联合（*x）

组合它们

import itertools
subsets = [a,b,c,d,e,f]
def foo(A, subsets):
    found = []
    for n in range(2,len(subsets)):
        for x in itertools.combinations(subsets, n):
            u =  set().union(*x)
            if A==u:
                found.append(x)
        if found:
            break
    return found
print foo(A,subsets)

产生：

[(set([1, 2, 3]), set([4, 5, 6, 7]), set([8, 9, 10, 5])), 
 (set([1, 2, 3, 4]), set([4, 5, 6, 7]), set([8, 9, 10, 5]))]

对于本例，它的运行速度比您的代码快一点，但如果我将其展开以跟踪子集名称，它的运行速度会慢一点。但这是一个小示例，因此计时并不意味着太多。（编辑-如另一个答案中所示，这种方法会因更大的问题而大大减慢）

numpy

没有任何帮助，因为我们没有处理数组或并行操作。正如其他人所说，这基本上是一个搜索问题。你可以加快内部步骤，并尝试删除死角，但你无法避免尝试许多替代方法

在

numpy

中进行搜索的通常方法是构造一个包含所有组合的矩阵，然后用sum、min或max之类的值提取所需的组合。这是一种蛮力方法，利用数组上的快速编译操作。

感谢这个问题，我发现它非常有趣。我已经测试了下面的代码在Python2.6、2.7和3.3上，您可能会发现自己运行它很有趣，我使它易于粘贴到解释器或作为脚本运行

这里的另一个解决方案试图通过暴力解决，即通过每一个可能的组合，这可能对十个元素都是可行的，提问者给出了一个例子，但不会为提问者要求的参数提供解决方案，即选择子集的组合（多达1500个元素长，来自15000个元素的超集）来自30000个集合。我发现对于这些参数，试图找到一个解集，其中n=40（非常不可能）意味着在一个googol上搜索多个组合顺序，这是非常不可能的

设置在这里，我导入了一些用于对函数进行基准测试和创建数据的模块。我还创建了一个计时器装饰器来包装函数，以便可以轻松测量函数完成之前经过的时间（或者我放弃并中断函数）

数据创建功能接下来，我必须创建数据：

@timer
def make_subsets(parent_set, n):
    '''create list of subset sets, takes about 17 secs'''
    subsets = []
    for i in range(n): # use xrange in python 2
        subsets.append(set(random.sample(parent_set, random.randint(1, MAX_SUBSET_SIZE))))
    return subsets


@timer
def include_complement(parent_set, subsets):
    '''ensure no missing elements from parent, since collected randomly'''
    union_subsets = set().union(*subsets)
    subsets_complement = set(parent_set) - union_subsets
    logging.info('len of union of all subsets was {0}'.format(
                                          len(union_subsets)))
    if subsets_complement:
        logging.info('len of subsets_complement was {0}'.format(
                                          len(subsets_complement)))
        subsets.append(subsets_complement)
    return subsets

可选预处理我提供了一些预处理，它在几秒钟内运行，但没有多大帮助，只加快了几分之一秒，但记录在这里供读者启发：

@timer
def remove_redundant_subsets(subsets):
    '''
    without break, takes a while, removes 81 sets of len <= 4 (seed(0))
    in 5.5 minutes, so breaking at len 10 for 4 second completion.
    probably unnecessary if truly random subsets
    but *may* be good if large subsets are subsets of others.
    '''
    subsets.sort(key=len)
    remove_list = []
    for index, s in enumerate(subsets, 1):
        if len(s) > 10: # possible gain not worth continuing farther
            break
        if any(s.issubset(other) for other in subsets[index:]):
            logging.debug('will remove subset: {s}'.format(s=s))
            remove_list.append(s)
    logging.info('subsets removing: {0}'.format(len(remove_list)))
    for s in remove_list:
        subsets.remove(s)
    return subsets

最终结果这提供了46个（ish）子集的最终结果，根据提问者给出的原始参数，在Python 2中运行

以下是种子（0）的输出：

下面是种子（1）的输出：

这很有趣，谢谢你的问题

PS：我决定尝试测试naive brute force方法：

INFO:root:make_subsets function took 17984.412 ms
INFO:root:len of union of all subsets was 15000
INFO:root:include_complement function took 2412.666 ms
INFO:root:foo function interrupted after 3269064.913 ms

很自然地，我打断了它，因为在我的有生之年，也许在我们太阳的有生之年，它永远不会靠近我：

>>> import math
>>> def combinations(n, k):
...     return math.factorial(n)/(math.factorial(k)*math.factorial(n-k))
... 
>>> combinations(30000, 40)
145180572634248196249221943251413238587572515214068555166193044430231638603286783165583063264869418810588422212955938270891601399250L

“子集合”上的元素不必在

集合右侧？或者是guaran

@timer
def greedy_set_cover(subsets, parent_set):
    parent_set = set(parent_set)
    results = []
    result_set = set()
    while result_set < parent_set:
        logging.debug('len of result_set is {0}'.format(len(result_set)))
        # maybe room for optimization here: Will still have to calculate.
        # But custom max could shortcut subsets on uncovered more than len.
        add_set = max(subsets, key=lambda x: len(x - result_set))
        logging.debug('len of add_set is {0}'.format(len(add_set)))
        results.append(add_set)
        result_set.update(add_set)
    return results

# full set, use xrange instead of range in python 2 for space efficiency    
parent_set = range(PARENT_SIZE) 
subsets = make_subsets(parent_set, N_SUBSETS)
logging.debug(len(subsets))
subsets = include_complement(parent_set, subsets) # if necessary
logging.debug(len(subsets))
subsets = remove_redundant_subsets(subsets)
logging.debug(len(subsets))
results = greedy_set_cover(subsets, parent_set)
logging.info('len of results is {0}'.format(len(results)))
for i, set in enumerate(results, 1):
    logging.debug('len of set {0} is {1}'.format(i, len(set)))

INFO:root:make_subsets function took 17158.725 ms
INFO:root:len of union of all subsets was 15000
INFO:root:include_complement function took 2716.381 ms
INFO:root:subsets removing: 81
INFO:root:remove_redundant_subsets function took 3319.620 ms
INFO:root:greedy_set_cover function took 188026.052 ms
INFO:root:len of results is 46

INFO:root:make_subsets function took 17538.083 ms
INFO:root:len of union of all subsets was 15000
INFO:root:include_complement function took 2414.091 ms
INFO:root:subsets removing: 68
INFO:root:remove_redundant_subsets function took 3218.643 ms
INFO:root:greedy_set_cover function took 189019.275 ms
INFO:root:len of results is 47

INFO:root:make_subsets function took 17984.412 ms
INFO:root:len of union of all subsets was 15000
INFO:root:include_complement function took 2412.666 ms
INFO:root:foo function interrupted after 3269064.913 ms

>>> import math
>>> def combinations(n, k):
...     return math.factorial(n)/(math.factorial(k)*math.factorial(n-k))
... 
>>> combinations(30000, 40)
145180572634248196249221943251413238587572515214068555166193044430231638603286783165583063264869418810588422212955938270891601399250L