Python：跨多对多映射查找子集_Python_Set_Many To Many

Python：跨多对多映射查找子集

python

Python：跨多对多映射查找子集,python,set,many-to-many,Python,Set,Many To Many,我正在尝试使用多对多映射，查找一个集合的子集，这些子集映射到另一个集合的特定子集我有很多基因。每个基因都是一个或多个COG的成员（反之亦然），例如 gene1是COG1的成员 gene1是COG1003的成员 gene2是COG2的成员 gene3是COG273的成员 gene4是COG1的成员 gene5是COG273的成员 gene5是COG71的成员 gene6是COG1的成员 gene6是COG273的成员我有一组代表酶的短COG，例如COG1，COG273 我想找到它们之间的所

我正在尝试使用多对多映射，查找一个集合的子集，这些子集映射到另一个集合的特定子集

我有很多基因。每个基因都是一个或多个COG的成员（反之亦然），例如

gene1是COG1的成员
gene1是COG1003的成员
gene2是COG2的成员
gene3是COG273的成员
gene4是COG1的成员
gene5是COG273的成员
gene5是COG71的成员
gene6是COG1的成员
gene6是COG273的成员

我有一组代表酶的短COG，例如COG1，COG273

我想找到它们之间的所有基因集，它们都是酶中每个COG的成员，但没有不必要的重叠（例如，在这种情况下，“gene1和gene6”是虚假的，因为gene6已经是这两个COG的成员）

在本例中，答案是：

基因1和基因3
基因1和基因5
基因3和基因4
基因4和基因5
基因6

虽然我可以获得每个COG的所有成员并创建一个“产品”，但这将包含虚假的结果（如上所述），即集合中的基因超过了必要的数量

我的映射目前包含在一个字典中，其中键是基因ID，值是该基因所属COG ID的列表。但是，我承认这可能不是存储映射的最佳方式。

一个基本攻击：

Keep your representation as it is for now.
Initialize a dictionary with the COGs as keys; each value is an initial count of 0.

Now start building your list of enzyme coverage sets (ecs_list), one ecs at a time.  Do this by starting at the front of the gene list and working your way to the end, considering all combinations.

Write a recursive routine to solve the remaining COGs in the enzyme.  Something like this:

def pick_a_gene(gene_list, cog_list, solution_set, cog_count_dict):
   pick the first gene in the list that is in at least one cog in the list.
   let the rest of the list be remaining_gene_list.
   add the gene to the solution set.
   for each of the gene's cogs:
      increment the cog's count in cog_count_dict
      remove the cog from cog_list (if it's still there).
   add the gene to the solution set.

   is there anything left in the cog_list?
   yes:
      pick_a_gene(remaining_gene_list, cog_list, solution_set, cog_count_dict)
   no:    # we have a solution: check it for minimality
      from every non-zero entry in cog_count_dict, subtract 1.  This gives us a list of excess coverage.
      while the excess list is not empty:
         pick the next gene in the solution set, starting from the *end* (if none, break the loop)
         if the gene's cogs are all covered by the excess:
            remove the gene from the solution set.
            decrement the excess count of each of its cogs.

      The remaining set of genes is an ECS; add it to ecs_list

这对你有用吗？我相信它正确地覆盖了最小集，考虑到您的示例表现良好。请注意，从高端开始，当我们检查“最低限度”时，会防止出现如下情况：

gene1: cog1, cog5
gene2: cog2, cog5
gene3: cog3
gene4: cog1, cog2, cog4
enzyme: cog1 - cog5

我们可以看到我们需要基因3，基因4，基因1或基因2。如果我们从低端淘汰，我们将淘汰gene1，永远找不到解决方案。如果我们从高端开始，我们将消除gene2，但在主循环的后续过程中找到该解决方案

有可能构建这样一个案例，其中存在类似的三方冲突。在这种情况下，我们必须在最小值检查中编写一个额外的循环来查找它们。不过，我想您的数据对我们来说并没有那么糟糕。

这对您有用吗？请注意，因为您说您有一个短的齿轮集，所以我继续进行嵌套for循环；也许有办法优化这个

为了将来的参考，请在你的问题中附上你的任何代码

import itertools

d = {'gene1':['COG1','COG1003'], 'gene2':['COG2'], 'gene3':['COG273'], 'gene4':['COG1'], 'gene5':['COG273','COG71'], 'gene6':['COG1','COG273']}

COGs = [set(['COG1','COG273'])] # example list of COGs containing only one enzyme; NOTE: your data should be a list of multiple sets

# create all pair-wise combinations of our data
gene_pairs = [l for l in itertools.combinations(d.keys(),2)]

found = set()
for pair in gene_pairs:

    join = set(d[pair[0]] + d[pair[1]]) # set of COGs for gene pairs

    for COG in COGs:

        # check if gene already part of enzyme
        if sorted(d[pair[0]]) == sorted(list(COG)):
            found.add(pair[0])
        elif sorted(d[pair[1]]) == sorted(list(COG)):
            found.add(pair[1])

        # check if gene combinations are part of enzyme
        if COG <= join and pair[0] not in found and pair[1] not in found:
            found.add(pair)

for l in found:
    if isinstance(l, tuple): # if tuple
        print l[0], l[1]
    else:
        print l

导入itertools
d={'gene1'：['COG1'，'COG1003']，'gene2'：['COG2']，'gene3'：['COG273']，'gene4'：['COG1']，'gene5'：['COG273'，'COG71']，'gene6'：['COG1'，'COG273']}
COGs=[set（['COG1'，'COG273']）]#仅包含一种酶的COGs示例列表；注意：您的数据应该是多个集合的列表
#创建数据的所有成对组合
基因对=[itertools.组合中的l代表l（d.键（），2）]
found=set（）
对于基因对中的配对：
join=set（d[pair[0]]+d[pair[1]]）#基因对的COG集
对于齿轮中的齿轮：
#检查基因是否已经是酶的一部分
如果已排序（d[pair[0]]）==已排序（列表（COG））：
找到。添加（对[0]）
elif排序（d[对[1]]）==排序（列表（COG））：
找到。添加（对[1]）
#检查基因组合是否是酶的一部分
如果齿轮
输出：
lt=[（'gene1'，'COG1'），（'gene1'，'COG1003'），（'gene2'，'COG2'），（'gene3'，'COG273'），（'gene4'，'COG1'），
（'gene5'，'COG273'），（'gene5'，'COG71'），（'gene6'，'COG1'），（'gene6'，'COG273'）]
findGenes（'COG1'，'COG273'，lt）
（‘基因1’、‘基因3’）
（‘基因1’、‘基因5’）
（‘基因4’、‘基因3’）
（‘基因4’、‘基因5’）
['gene6']
谢谢你的建议，它们启发了我使用递归将一些东西组合起来。我想处理任意的基因cog关系，所以它需要一个通用的解决方案。这将产生所有的基因（酶），它们之间是所有必需COG的成员，没有重复的酶，也没有多余的基因：
def get_enzyme_cogs(enzyme, gene_cog_dict):
    """Get all COGs of which there is at least one member gene in the enzyme."""
    cog_list = []
    for gene in enzyme:
        cog_list.extend(gene_cog_dict[gene])
    return set(cog_list)

def get_gene_by_gene_cogs(enzyme, gene_cog_dict):
    """Get COG memberships for each gene in enzyme."""
    cogs_list = []
    for gene in enzyme:
        cogs_list.append(set(gene_cog_dict[gene]))
    return cogs_list

def add_gene(target_enzyme_cogs, gene_cog_dict, cog_gene_dict, proposed_enzyme = None, fulfilled_cogs = None):
    """Generator for all enzymes with membership of all target_enzyme_cogs, without duplicate enzymes or redundant genes."""

    base_enzyme_genes = proposed_enzyme or []
    fulfilled_cogs = get_enzyme_cogs(base_enzyme_genes, target_enzyme_cogs, gene_cog_dict)

    ## Which COG will we try to find a member of?
    next_cog_to_fill = sorted(list(target_enzyme_cogs-fulfilled_cogs))[0]
    gene_members_of_cog = cog_gene_dict[next_cog_to_fill] 

    for gene in gene_members_of_cog:

        ## Check whether any already-present gene's COG set is a subset of the proposed gene's COG set, if so skip addition
        subset_found = False
        proposed_gene_cogs = set(gene_cog_dict[gene]) & target_enzyme_cogs
        for gene_cogs_set in get_gene_by_gene_cogs(base_enzyme_genes, target_enzyme_cogs, gene_cog_dict):
            if gene_cogs_set.issubset(proposed_gene_cogs):
                subset_found = True
                break
        if subset_found:
            continue

        ## Add gene to proposed enzyme
        proposed_enzyme = deepcopy(base_enzyme_genes)
        proposed_enzyme.append(gene)

        ## Determine which COG memberships are fulfilled by the genes in the proposed enzyme
        fulfilled_cogs = get_enzyme_cogs(proposed_enzyme, target_enzyme_cogs, gene_cog_dict)

        if (fulfilled_cogs & target_enzyme_cogs) == target_enzyme_cogs:
            ## Proposed enzyme has members of every required COG, so yield 
            enzyme = deepcopy(proposed_enzyme)
            proposed_enzyme.remove(gene)
            yield enzyme
        else:
            ## Proposed enzyme is still missing some COG members
            for enzyme in add_gene(target_enzyme_cogs, gene_cog_dict, cog_gene_dict, proposed_enzyme, fulfilled_cogs):
                yield enzyme

输入：
gene_cog_dict = {'gene1':['COG1','COG1003'], 'gene2':['COG2'], 'gene3':['COG273'], 'gene4':['COG1'], 'gene5':['COG273','COG71'], 'gene6':['COG1','COG273']}
cog_gene_dict = {'COG2': ['gene2'], 'COG1': ['gene1', 'gene4', 'gene6'], 'COG71': ['gene5'], 'COG273': ['gene3', 'gene5', 'gene6'], 'COG1003': ['gene1']}

target_enzyme_cogs = ['COG1','COG273']

用法：
for enzyme in add_gene(target_enzyme_cogs, gene_cog_dict, cog_gene_dict):
    print enzyme

输出：
['gene1', 'gene3']
['gene1', 'gene5']
['gene4', 'gene3']
['gene4', 'gene5']
['gene6']

但我不知道它的性能
['gene1', 'gene3']
['gene1', 'gene5']
['gene4', 'gene3']
['gene4', 'gene5']
['gene6']