Python：用一系列字符查找所有可能的单词组合（分词）_Python_Combinations_Permutation_Combinatorics

Python：用一系列字符查找所有可能的单词组合（分词）

python

Python：用一系列字符查找所有可能的单词组合（分词）,python,combinations,permutation,combinatorics,Python,Combinations,Permutation,Combinatorics,我正在做一些分词实验，如下所示 lst是一个字符序列，output是所有可能的单词 lst = ['a', 'b', 'c', 'd'] def foo(lst): ... return output output = [['a', 'b', 'c', 'd'], ['ab', 'c', 'd'], ['a', 'bc', 'd'], ['a', 'b', 'cd'], ['ab', 'cd'],

我正在做一些分词实验，如下所示

lst

是一个字符序列，

output

是所有可能的单词

lst = ['a', 'b', 'c', 'd']

def foo(lst):
    ...
    return output

output = [['a', 'b', 'c', 'd'],
          ['ab', 'c', 'd'],
          ['a', 'bc', 'd'],
          ['a', 'b', 'cd'],
          ['ab', 'cd'],
          ['abc', 'd'],
          ['a', 'bcd'],
          ['abcd']]

我在

itertools

库中检查了

组合

和

排列

，
也尝试过。
然而，我似乎看到了错误的一面，因为这不是纯粹的排列和组合

似乎我可以通过使用大量循环来实现这一点，但效率可能很低

编辑

词序很重要，因此像

['ba'，dc']

或

['cd'，ab']

这样的组合是无效的

顺序应始终从左到右

编辑

@Stuart的解决方案在Python 2.7.6中不起作用

编辑

@Stuart的解决方案在Python 2.7.6中确实有效，请参见下面的注释。

有8个选项，每个选项都反映二进制数0到7：

每个0和1表示该索引处的2个字母是否“粘合”在一起。0表示否，1表示是

>>> lst = ['a', 'b', 'c', 'd']
... output = []
... formatstr = "{{:0{}.0f}}".format(len(lst)-1)
... for i in range(2**(len(lst)-1)):
...     output.append([])
...     s = "{:b}".format(i)
...     s = str(formatstr.format(float(s)))
...     lstcopy = lst[:]
...     for j, c in enumerate(s):
...         if c == "1":
...             lstcopy[j+1] = lstcopy[j] + lstcopy[j+1]
...         else:
...             output[-1].append(lstcopy[j])
...     output[-1].append(lstcopy[-1])
... output
[['a', 'b', 'c', 'd'],
 ['a', 'b', 'cd'],
 ['a', 'bc', 'd'],
 ['a', 'bcd'],
 ['ab', 'c', 'd'],
 ['ab', 'cd'],
 ['abc', 'd'],
 ['abcd']]
>>>

输出：

zsh 2419 % ./words.py  
['abcd']
['a', 'bcd']
['ab', 'cd']
['abc', 'd']
['a', 'b', 'cd']
['a', 'bc', 'd']
['ab', 'c', 'd']
['a', 'b', 'c', 'd']

itertools.product

确实可以帮助您

这个想法是：- 考虑A1，A2，…，一个由板条隔开的。将有N-1块板。如果有楼板，则存在分段。如果没有楼板，则存在连接。因此，对于长度为N的给定序列，应该有2^（N-1）个这样的组合

如下图所示

import itertools
lst = ['a', 'b', 'c', 'd']
combinatorics = itertools.product([True, False], repeat=len(lst) - 1)

solution = []
for combination in combinatorics:
    i = 0
    one_such_combination = [lst[i]]
    for slab in combination:
        i += 1
        if not slab: # there is a join
            one_such_combination[-1] += lst[i]
        else:
            one_such_combination += [lst[i]]
    solution.append(one_such_combination)

print solution

您可以使用递归生成器：

def split_combinations(L):
    for split in range(1, len(L)):
        for combination in split_combinations(L[split:]):
            yield [L[:split]] + combination
    yield [L]

print (list(split_combinations('abcd')))

编辑。我不确定这对长字符串的扩展效果如何，以及它在什么时候达到Python的递归限制。与其他一些答案类似，您也可以使用

itertools

中的

组合

来处理分割点的所有可能组合

def split_string(s, t):
    return [s[start:finish] for start, finish in zip((None, ) + t, t + (None, ))]

def split_combinations(s):
    for i in range(len(s)):
        for split_points in combinations(range(1, len(s)), i):
            yield split_string(s, split_points)

这两种方法似乎都能在Python2.7（）和Python3.2（）中正常工作。正如@twasbrillig所说，请确保按图所示缩进它。

您的代码对lst=['a'、'b'、'c'、'd'、'e']无效。谢谢！修复了它的工作情况下，更多的字母以及。即使我没有得到接受，这也是一个有趣的练习：）^哈哈！我比你投的票高。我们都有相同的解决方案。你跑得更快了！：-）谢谢是的，我们也以同样的方式看到了这个问题，你的问题很好地利用了itertools库，我会仔细阅读。我在[1]中的输出

中没有看到['abc''d']：def split_compositions（L）：…：对于范围内的拆分（1，len（L））：…：对于拆分组合中的组合（L[split:]）：：收益率[L[：拆分]]+组合…：收益率[L]…：在[2]：打印（列表（拆分组合（'abcd'））['a'，'b'，'cd']，['a'，'bcd']，['a'，'bcd']，['abcd']，['ab'，'cd']，['abcd']，['abcd']，['abcd']

>def拆分组合（L）：。。。对于范围（1，len（L））中的拆分：。。。对于拆分组合中的组合（L[split:]）：。。。收益率[L[：分割]]+组合。。。产量[升]。。。打印（列表（拆分组合（'abcd'））['a'，'b'，'c'，'d']，['a'，'b'，'cd']，['a'，'bc'，'d']，['a'，'bcd']，['ab'，'c'，'d']，['abc'，'cd']，['abc'，'d']，['abcd']

Python 3.4.0（默认，2014年4月11日，13:05:11）计算出了差异。如果

yield[L]

缩进两次而不是一次，则得到的结果不正确。确保它只缩进一次，你就应该得到正确的答案。我不能投票给你。我在你的回答下面的评论中写下了原因。你编写的代码比我的小，这很好，但是，在我的机器上，你的代码工作得不好：（这实际上是斯图尔特的答案，不是我的，但斯里亚姆比我高，所以一切都很好。看我在Python 2.7.3和Python 3.2.3中的代码，我选择这个答案是因为它花费的时间最少，但其他的解决方案也很好！

def split_string(s, t):
    return [s[start:finish] for start, finish in zip((None, ) + t, t + (None, ))]

def split_combinations(s):
    for i in range(len(s)):
        for split_points in combinations(range(1, len(s)), i):
            yield split_string(s, split_points)