Python 执行my函数从字典中删除作为其他键的子串的键_Python_Performance_Algorithm_Dictionary

Python 执行my函数从字典中删除作为其他键的子串的键

python performance algorithm dictionary

Python 执行my函数从字典中删除作为其他键的子串的键,python,performance,algorithm,dictionary,Python,Performance,Algorithm,Dictionary,我很好奇为什么删除代码中的一行会显著提高性能。该函数本身接受一个字典，并删除作为其他键的子串的所有键降低代码速度的一行是： if sub in reduced_dict and sub2 in reduced_dict: 以下是我的功能： def reduced(dictionary): reduced_dict = dictionary.copy() len_dict = defaultdict(list) for key in dictionary:

我很好奇为什么删除代码中的一行会显著提高性能。该函数本身接受一个字典，并删除作为其他键的子串的所有键

降低代码速度的一行是：

if sub in reduced_dict and sub2 in reduced_dict:

以下是我的功能：

def reduced(dictionary):
    reduced_dict = dictionary.copy()
    len_dict = defaultdict(list)
    for key in dictionary:
        len_dict[len(key)].append(key)
    start_time = time.time()
    for key, subs in len_dict.items():
        for key2, subs2 in len_dict.items():
            if key2 > key:
                for sub in subs:
                    for sub2 in subs2:
                        if sub in reduced_dict and sub2 in reduced_dict: # Removing this line gives a significant performance boost
                            if sub in sub2:
                                reduced_dict.pop(sub, 0)
    print time.time() - start_time
    return reduced_dict

该函数多次检查sub是否在sub2中。我认为，如果我检查已经进行的比较，我将节省自己的时间。事实似乎并非如此。为什么在字典中查找的恒定时间函数会减慢我的速度

我是初学者，所以我对概念感兴趣

当我测试所讨论的行是否返回False时，似乎是这样。我已经用以下方法对此进行了测试

def reduced(dictionary):
    reduced_dict = dictionary.copy()
    len_dict = defaultdict(list)
    for key in dictionary:
        len_dict[len(key)].append(key)
    start_time = time.time()
    for key, subs in len_dict.items():
        for key2, subs2 in len_dict.items():
            if key2 > key:
                for sub in subs:
                    for sub2 in subs2:
                        if sub not in reduced_dict or sub2 not in reduced_dict:
                            print 'not present' # This line prints many thousands of times
                        if sub in sub2:
                            reduced_dict.pop(sub, 0)
    print time.time() - start_time
    return reduced_dict

对于函数输入字典中的14805个键：

19.6360001564秒。不排队
33.1449999809秒。排队

下面是3个字典示例，及

在最大的示例字典中，我已经为前14000个键绘制了以秒（Y）为单位的时间与以键数（X）为单位的输入大小的图表。似乎所有这些函数都具有指数复杂性

约翰·兹温克回答这个问题
马特：我的算法没有字典比较
这是我第一次尝试解决这个问题。这花了76秒
Matt compare是这个问题中的算法，与dict比较行一致
谢谢你回答这个问题。顺序算法1和2
格奥尔格从我问的一个相关问题

接受的答案在明显的线性时间内执行。

我很惊讶地发现输入大小存在神奇的比率，其中dict查找的运行时==字符串搜索。

您创建len_dict，但即使它将大小相同的键分组，您仍然必须多次遍历所有内容以进行比较。你的基本计划是正确的——按大小排序，只比较相同大小或更大的，但还有其他方法可以做到这一点。下面，我只是创建了一个按键大小排序的常规列表，然后向后迭代，这样我就可以在运行时修剪dict。我很好奇它的执行时间和你的相比如何。它在0.049秒内完成了你的小口述示例

（我希望它真的有效！）

编辑

通过不解包k_-fwd，v_-fwd，速度显著提高（在运行了两次之后，这并不是一个真正的速度提升。我的电脑上一定有一段时间被其他东西占用了）

我会做得有点不同。这里有一个生成器函数，它只提供“好”键。这避免了创建一个dict，该dict可能会被一个键一个键地破坏。我还有两个级别的“for”循环和一些简单的优化，试图更快地找到匹配项，避免搜索不可能的匹配项

def reduced_keys(dictionary):
    keys = dictionary.keys()
    keys.sort(key=len, reverse=True) # longest first for max hit chance                                                                                                     
    for key1 in keys:
        found_in_key2 = False
        for key2 in keys:
            if len(key2) <= len(key1): # no more keys are long enough to match                                                                                              
                break
            if key1 in key2:
                found_in_key2 = True
                break
        if not found_in_key2:
            yield key1

对于示例语料库，或者大多数关键点都很小的任何语料库，测试所有可能的子关键点要快得多：

def reduced(dictionary):
    keys = set(dictionary.iterkeys())
    subkeys = set()
    for key in keys:
        for n in range(1, len(key)):
            for i in range(len(key) + 1 - n):
               subkey = key[i:i+n]
               if subkey in keys:
                   subkeys.add(subkey)

    return {k: v
            for (k, v) in dictionary.iteritems()
            if k not in subkeys}

在我的系统（i7-3720QM 2.6GHz）上大约需要0.2秒的时间。

您能提供用于测试此功能的示例DICT吗？此外，使用

time.time（）

来测量这一点通常不够准确。您应该改用

timeit

模块。第一个较长，第二个较短。由于四重嵌套的“for”循环，您的代码可能比其他任何循环都慢@John Zwinck上述代码的运行速度比我前面一个问题中的解决方案快得多，同时尝试解决相同的问题。我将尝试解决嵌套问题，但这是另一个问题。您描述了一个对键进行操作的算法，但您的代码对值进行操作。那只是打字错误吗？你的功能更快了。对于14805个键，没有问题行的函数运行时间为17.75秒。你的在12.3139998913秒。您的函数返回了7087个键&我的7086，这很有趣。我喜欢你的方法。你知道为什么字典查找扩展了我的函数的运行时吗？@mattkaeo字典查找不是免费的，它们只是（伪）常量时间。由于代码的结构，您正在进行数百万次的查找。1。您的len_dict.items（）的2 for循环意味着您在第二个循环len（len_dict）次中重建项目，然后在if中立即丢弃大部分项目。与列表上的索引查找相比，其余的dict查找相当昂贵。事实上，我还以为我会比你赢更多呢。现在我很困惑…@tdelaney这很有趣。我不知道字典查找可能比列表中的索引查找更昂贵。这是我用于实验的更大的14805键长字典。哎呀，我从我生成的带有22056项的dict中读取了结果。你的14805集是4秒。有趣的是，更大的集合增加了时间-不是指数增长！对于我的14805键字典，请使用与我的函数相同的7086键粘贴.it/14805keys yours=27.7s。我的=17.7秒。我会尝试用另一种方式来计时。你的发电机最贵的部分是什么？是长度比较还是填充字典？很有趣。我们可以将key1的长度缓存在内部循环之外。如果您想要更高的速度（但仍然可以从Python使用），那么这个算法应该很容易移植到C。我很想知道，如果你按非相反顺序排序，并完全检查密钥长度，它的速度有多快。这是最快的算法。我正在画103000多把钥匙。我将把这个图表添加到问题中。在dict查找==字符串搜索的情况下，输入大小似乎有一个神奇的比率。

def reduced_keys(dictionary):
    keys = dictionary.keys()
    keys.sort(key=len, reverse=True) # longest first for max hit chance                                                                                                     
    for key1 in keys:
        found_in_key2 = False
        for key2 in keys:
            if len(key2) <= len(key1): # no more keys are long enough to match                                                                                              
                break
            if key1 in key2:
                found_in_key2 = True
                break
        if not found_in_key2:
            yield key1

{ key: d[key] for key in reduced_keys(d) }

def reduced(dictionary):
    keys = set(dictionary.iterkeys())
    subkeys = set()
    for key in keys:
        for n in range(1, len(key)):
            for i in range(len(key) + 1 - n):
               subkey = key[i:i+n]
               if subkey in keys:
                   subkeys.add(subkey)

    return {k: v
            for (k, v) in dictionary.iteritems()
            if k not in subkeys}