使用repeat获取单词的所有组合-python itertools.product太慢_Python_Numpy_Itertools

使用repeat获取单词的所有组合-python itertools.product太慢

python numpy

使用repeat获取单词的所有组合-python itertools.product太慢,python,numpy,itertools,Python,Numpy,Itertools,我有一个数组值，我需要得到所有可能的组合。使用itertools.product很容易做到这一点苹果可以是elppa、appel、lppae等然而，警告是双重的我需要把这个单词的所有字母组合重复30次。例如：aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

我有一个数组值，我需要得到所有可能的组合。使用itertools.product很容易做到这一点

苹果可以是elppa、appel、lppae等

然而，警告是双重的

我需要把这个单词的所有字母组合重复30次。例如：aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

显然，我们在这里使用的是一个巨大的数据源，所以当我使用eg 6-10重复运行测试时，速度相当快（即不到一分钟）。当在夜间运行30次方的测试时，它表明测试需要几天才能完成

我曾经使用过Numpy，通常在StackOverflow上建议使用Numpy作为一种更快/更轻的方法。但我在这个问题上做得不好，因为我发现的所有变化都导致脚本杀死我的机器并占用磁盘空间，而不是速度慢（对于本测试来说太慢），但效率更高

另外，我不明白如何将所有这些数据拉入一个numty数组，然后在不增加系统开销的情况下计算以下内容

最终

这个练习的重点是计算单词apple在每一行结果中出现的次数。但只有当它连续出现一次时。这会算数：aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa 这不会：appleaaaaaaaaaaaaaaapple

下面的代码工作时不会对机器造成太大压力，但运行速度太慢

谢谢

import itertools
import time
import numpy as np

apple = ['a','p','l','e']
occurences = 0
line = 0
arr_len = len(apple)
length = 30
squared = arr_len**length

start_time = time.time()

for string in itertools.imap(''.join, itertools.product(apple, repeat=length)):
    line += 1
    if (string.count('apple')==1):
        occurences += 1
        if occurences % 100000 == 0:
            print occurences, ("--- %s seconds ---" % (time.time() - start_time)),squared, line

print ('Occurences : ',occurences)
print ('Last line no. ',line)  
print ("--- %s seconds ---" % (time.time() - start_time))

你试图解决问题的方式本质上是指数型的。您需要使用动态规划。这个问题有多项式时间解。如果你的单词有n个字符，你可以使用一个有2n个状态的马尔可夫链

import numpy as np

word = 'papal'
length = 10

word_chars = list(set(word))
n = len(word)
m = len(word_chars)
states = [0] * (2*n)
states[0] = 1
jumps = np.zeros((n, m), dtype=np.int)
for i in range(n):
    for j in range(m):
        # We've seen the first i characters of word, and we see character word_chars[j]
        if word[i] == word_chars[j]:
            value = i+1
        else:
            for k in range(i+1):
                if word[k: i] + word_chars[j] == word[:i - k + 1]:
                    value = i - k + 1
                    break
            else:
                value = 0
        jumps[i, j] = value

for i in range(length):
   new_states = [0] * (2*n)
    for j in range(n):
        for jump in jumps[j]:
            new_states[jump] += states[j]
            if n+jump < 2*n:
                new_states[n+jump] += states[n+j]
    states = new_states

print(np.sum(states[n:]))

将numpy导入为np
单词=‘教皇’
长度=10
单词字符=列表（集合（单词））
n=len（字）
m=len（字字符）
状态=[0]*（2*n）
状态[0]=1
跳转=np.zero（（n，m），dtype=np.int）
对于范围（n）中的i：
对于范围内的j（m）：
#我们已经看到了单词的第一个i字符，我们看到了字符word_chars[j]
如果单词[i]==单词[j]：
值=i+1
其他：
对于范围（i+1）内的k：
如果单词[k:i]+单词字符[j]==单词[:i-k+1]：
值=i-k+1
打破
其他：
值=0
跳跃[i，j]=值
对于范围内的i（长度）：
新的_状态=[0]*（2*n）
对于范围（n）内的j：
对于跳转[j]：
新的_状态[jump]+=状态[j]
如果n+跳跃<2*n：
新的_状态[n+jump]+=状态[n+j]
州=新的州
打印（np.sum（状态[n:]））

如果单词是“爸爸”，那么“爸爸”是否匹配？如果没有，您应该删除马尔可夫链中的状态。

稍微考虑一下，我们可以应用一些基本概率的计数技术来计算一个单词最多出现一次的序列数。然而，动态规划解决方案可能更容易提出，并且对于较小的序列大小可能运行得更快——下面的解决方案在序列长度上具有线性时间复杂度，但没有针对速度进行优化，我只是将其发布在这里作为参考：

from scipy.misc import comb

def k(i, t, w):
    """Compute the number of occurrences.

    Arguments
    ---------
    i : int
        The number of products.
    w : int
        Length of the word.
    t : int
        The number of characters in the alphabet.

    """
    # consider all i - w + 1 ways of placing the word into `i` slots,
    # and subtract the number of sequences with multiple occurrences (n_x, n_y)
    r = i - w
    tot = 0
    for x in range(r + 1):
        y = r - x
        n_y = 0 if y < w else (y - w + 1) * t**(y - w)
        n_x = 0 if x < w else (x - w + 1) * t**(x - w)
        s = t**x * n_y + t**y * n_x
        tot += t**r - s

    # for i >= 15 we must compute a correction, because we are "double 
    # counting" some sequences with multiple occurrences. The correction
    # turns out to be an alternating sequence of binomial coefficients
    cor = 0
    for c_k in range(2, i // w):
        c_n = (c_k + 1) + i - w * (c_k + 1)
        cor += (-1)**c_k * int(comb(c_n, c_k)) * n(i - w * c_k)

    return tot + cor

for

循环是否永远不会结束？将“string.count（'apple'）==1”替换为“'apple'in string”会大大加快您的解决方案。也许这可以帮助您@neig哦，我的意思是他的方法不可行。我相信这个解决方案可以用铅笔和纸来计算。我的解决方案为（苹果，30）找到了28768047528652794个匹配项来满足查询，例如（“爸爸”>“爸爸”），这将算作1个匹配项，即使有两次出现的情况。我们得到了相同的答案，真是太棒了！

>>> for i in range(31): print(i, k(i, 4, 5))

0 0
1 0
2 0
3 0
4 0
5 1
6 8
7 48
8 256
9 1280
10 6142
11 28648
12 130880
13 588544
14 2613760
15 11491331
16 50102320
17 216924640
18 933629696
19 3997722880
20 17041629180
21 72361164720
22 306190089280
23 1291609627904
24 5433306572800
25 22798585569285
26 95447339991160
27 398767643035280
28 1662849072252416
29 6921972555609600
30 28768047528652794