置换运算的Python性能_Python_Performance_Loops_Permutation

置换运算的Python性能

python performance loops

置换运算的Python性能,python,performance,loops,permutation,Python,Performance,Loops,Permutation,我有一个包含2000个单词的数据框。我试图创建所有单词的排列，每次4个 perms = it.permutations(words.col, 4) done=object() nextperm=next(perms,done) while(nextperm is not done): #perform operations... permutedString=' '.join(nextperm ) if(permutedString.c

我有一个包含2000个单词的数据框。我试图创建所有单词的排列，每次4个

   perms = it.permutations(words.col, 4)
   done=object()
   nextperm=next(perms,done)
   while(nextperm is not done):
      #perform operations...
      permutedString=' '.join(nextperm )
      if(permutedString.count('a')==2 and permutedString.count('b')==3 and 
         permutedString.count('c')==1):
          //calculate  Md5 hash with 
          hash=hashlib.md5(permutedString.encode('utf-8')).hexdigest()
      nextperm=next(perms,done)

上面的脚本对我来说太长了。它已经运行了几个小时了。有没有办法提高这方面的绩效

非常感谢您在这方面提供的任何帮助。

除了

p（2000，4）

是一个巨大的数字之外，您不需要使用sentinel手动检查循环，您只需直接在

perms

对象上迭代即可。这与您所提供的代码的预期优化差不多，无论循环内部发生什么，都将是决定因素

perms = it.permutations(words.col, 4)
for perm in perms:
    # do stuff with perm

除了

p（2000，4）

是一个巨大的数字之外，你不需要用哨兵手动检查你的循环，你可以直接迭代

perms

对象。这与您所提供的代码的预期优化差不多，无论循环内部发生什么，都将是决定因素

perms = it.permutations(words.col, 4)
for perm in perms:
    # do stuff with perm

正如tzaman所指出的，循环中发生的事情是找到更好的方法解决问题的关键。其基本思想是，您希望最小化需要执行的成对（或N-wise）操作的数量

在您的情况下，您显然是在尝试选择具有正确特定字母数的短语，尝试破解某种“正确的马电池钉”密码方案，并利用对字母计数的限制。在这种情况下，由于您只允许使用1个字母“c”，因此处理任何具有两个“c”的置换都没有意义。等等

我们也可以做得更好，而不仅仅是排除无用的单词：我们实际上不需要比较所有的单词来确定它们是否匹配，我们可以简单地比较单词计数集。也就是说，我们可以按照字母a、b和c的计数对所有单词进行分组，这可以在线性时间内完成，然后我们可以迭代其中的四个计数集，看看它们的总和是否正确。现在，我们只需对从~10而不是~2000的集合中提取的元素进行计数逻辑。（实际上，我们可以做得更好，因为我们可以递归地或使用分区技术直接构建适当的可能计数集，但让我们从简单开始。）

现在，您已经说过“只有一个或两个字符串符合此条件”，我将按照您的话，限制我将要进行的优化量，以处理类似的情况

如果只有少数几个满足约束，那么也必须只有几个字母计数组满足约束，并且该组中的单词不多。因此，类似这样的方法应该有效：

from collections import Counter
from itertools import permutations, product, combinations_with_replacement
import hashlib

# make a fake set of words
with open('/usr/share/dict/words') as fp:
    words = [word.lower() for word in fp.read().split()]
words = set(words[::len(words)//2000][:2000])

# set the target to something which has <2000 matching 4-words
target_counts = Counter({"a": 5, "b": 4, "d": 8})

# collect the words by counts
by_count = {}
for word in words:
    lcount = {letter: word.count(letter) for letter in target_counts}
    by_count.setdefault(tuple(sorted(lcount.items())), []).append(word)

collected_hashes = {}
# loop over every possible collection of word count groups
for i, groups in enumerate(combinations_with_replacement(by_count, 4)):
    if i % 10000 == 0:
        print(i, groups)

    # check to see whether the letter set sums appropriately
    total_count = sum((Counter(dict(group)) for group in groups), Counter())
    if total_count != target_counts:
        continue

    # the sums are right! loop over every word draw; for simplicity
    # we won't worry about duplicate word draws, we'll just skip
    # them if we see them
    for choices in product(*(by_count[group] for group in groups)):
        if len(set(choices)) != len(choices):
            # skip duplicate words
            continue
        for perm in permutations(choices):
            joined = ' '.join(perm)
            hashed = hashlib.md5(joined.encode("utf-8")).hexdigest()
            collected_hashes.setdefault(hashed, set()).add(joined)

如果每个密码确实具有正确数量的目标字母计数：

In [30]: c = Counter('barbed badlands saddlebag skidded')

In [31]: c['a'], c['b'], c['d']
Out[31]: (5, 4, 8)

正如tzaman所指出的，循环中发生的事情是找到更好的方法解决问题的关键。其基本思想是，您希望最小化需要执行的成对（或N-wise）操作的数量

现在，您已经说过“只有一个或两个字符串符合此条件”，我将按照您的话，限制我将要进行的优化量，以处理类似的情况

如果只有少数几个满足约束，那么也必须只有几个字母计数组满足约束，并且该组中的单词不多。因此，类似这样的方法应该有效：

from collections import Counter
from itertools import permutations, product, combinations_with_replacement
import hashlib

# make a fake set of words
with open('/usr/share/dict/words') as fp:
    words = [word.lower() for word in fp.read().split()]
words = set(words[::len(words)//2000][:2000])

# set the target to something which has <2000 matching 4-words
target_counts = Counter({"a": 5, "b": 4, "d": 8})

# collect the words by counts
by_count = {}
for word in words:
    lcount = {letter: word.count(letter) for letter in target_counts}
    by_count.setdefault(tuple(sorted(lcount.items())), []).append(word)

collected_hashes = {}
# loop over every possible collection of word count groups
for i, groups in enumerate(combinations_with_replacement(by_count, 4)):
    if i % 10000 == 0:
        print(i, groups)

    # check to see whether the letter set sums appropriately
    total_count = sum((Counter(dict(group)) for group in groups), Counter())
    if total_count != target_counts:
        continue

    # the sums are right! loop over every word draw; for simplicity
    # we won't worry about duplicate word draws, we'll just skip
    # them if we see them
    for choices in product(*(by_count[group] for group in groups)):
        if len(set(choices)) != len(choices):
            # skip duplicate words
            continue
        for perm in permutations(choices):
            joined = ' '.join(perm)
            hashed = hashlib.md5(joined.encode("utf-8")).hexdigest()
            collected_hashes.setdefault(hashed, set()).add(joined)

如果每个密码确实具有正确数量的目标字母计数：

In [30]: c = Counter('barbed badlands saddlebag skidded')

In [31]: c['a'], c['b'], c['d']
Out[31]: (5, 4, 8)

2000个单词中每4个单词的排列都是一个巨大的数字，略低于16万亿。你预计要花多长时间？是的。排列的数量将是一个巨大的数字。我是Python新手。这里尝试优化我的脚本以获得更好的性能。您在循环中执行的操作将是比烫发生成更重要的因素。您需要更改算法。用Python、C或其他语言所做的任何事情都不会使这个问题变得容易处理。如果只有一个或两个匹配的字符串，则绝对不需要枚举所有可能的排列并检查是否存在匹配！请输入代码的其余部分并添加一些解释，必须有更好的方法来做到这一点。2000个单词集合中4个单词的每个排列都是一个巨大的数字，略低于16万亿。你预计要花多长时间？是的。排列的数量将是一个巨大的数字。我是Python新手。这里尝试优化我的脚本以获得更好的性能。您在循环中执行的操作将是比烫发生成更重要的因素。您需要更改算法。用Python、C或其他语言所做的任何事情都不会使这个问题变得容易处理。如果只有一个或两个匹配的字符串，则绝对不需要枚举所有可能的排列并检查是否存在匹配！请把剩下的放进去