Python 通过筛选生成不同（距离，按编辑距离）单词的列表_Python_Algorithm_Duplicate Removal_Similarity_Edit Distance

Python 通过筛选生成不同（距离，按编辑距离）单词的列表

python algorithm

Python 通过筛选生成不同（距离，按编辑距离）单词的列表,python,algorithm,duplicate-removal,similarity,edit-distance,Python,Algorithm,Duplicate Removal,Similarity,Edit Distance,我有一个长长的（>1000项）单词列表，我想从中删除与其他单词“太相似”的单词，直到剩下的单词都“显著不同”。例如，使编辑距离D内没有两个单词我不需要一个唯一的解决方案，它也不一定是完全最优的，但它应该相当快（在Python中），并且不会丢弃太多的条目我怎样才能做到这一点？谢谢编辑：为了清楚起见，我可以在谷歌上搜索一个python例程来测量编辑距离。问题是如何有效地做到这一点，也许，以某种方式找到“自然”值D。也许可以从所有单词中构造某种类型的trie，然后进行修剪？您可以使用bk树，在添

我有一个长长的（>1000项）单词列表，我想从中删除与其他单词“太相似”的单词，直到剩下的单词都“显著不同”。例如，使编辑距离D内没有两个单词

我不需要一个唯一的解决方案，它也不一定是完全最优的，但它应该相当快（在Python中），并且不会丢弃太多的条目

我怎样才能做到这一点？谢谢

编辑：为了清楚起见，我可以在谷歌上搜索一个python例程来测量编辑距离。问题是如何有效地做到这一点，也许，以某种方式找到“自然”值D。也许可以从所有单词中构造某种类型的trie，然后进行修剪？

您可以使用

bk树

，在添加每个项目之前，检查它是否与其他任何项目的距离不在D之内（感谢@DietrichEpp在对这个想法的评论中

您可以使用bk树（尽管任何类似的配方都很容易修改）。只需做两个更改：更改行：

def __init__(self, items, distance, usegc=False):

到

换线

        if el not in self.nodes: # do not add duplicates

到

这样可以确保添加项目时没有重复项。然后，从列表中删除重复项的代码如下：

from Levenshtein import distance
from bktree import BKtree
def remove_duplicates(lst, threshold):
    tr = BKtree(iter(lst), distance, threshold)
    return tr.nodes.keys()

注意，这依赖于包的distance函数，它比bk-tree提供的要快得多

最后，我用越来越多的单词（随机从

/usr/share/dict/words

中抓取）设置了一个性能测试，并将每个单词的运行时间绘制成图表：

import random
import time
from Levenshtein import distance
from bktree import BKtree

with open("/usr/share/dict/words") as inf:
    word_list = [l[:-1] for l in inf]

def remove_duplicates(lst, threshold):
    tr = BKtree(iter(lst), distance, threshold)
    return tr.nodes.keys()

def time_remove_duplicates(n, threshold):
    """Test using n words"""
    nwords = random.sample(word_list, n)
    t = time.time()
    newlst = remove_duplicates(nwords, threshold)
    return len(newlst), time.time() - t

ns = range(1000, 16000, 2000)
results = [time_remove_duplicates(n, 3) for n in ns]
lengths, timings = zip(*results)

from matplotlib import pyplot as plt

plt.plot(ns, timings)
plt.xlabel("Number of strings")
plt.ylabel("Time (s)")
plt.savefig("number_vs_time.pdf")

如果不从数学上确认它，我不认为它是二次的，而且我认为它实际上可能是

n log n

，如果插入到bk树中是一个日志时间操作，这是有意义的。最值得注意的是，它在5000个字符串以下运行得非常快，希望这是OP的目标（15000是合理的，而传统的for-loop解决方案是不合理的）.

您的trie想法绝对是有趣的。在trie中有一个很好的快速编辑距离计算设置，如果您需要将词表扩展到数百万而不是1000，这在语料库语言学业务中是非常小的，那么肯定会非常有效

祝你好运，这听起来像是一个有趣的问题表示！

尝试没有帮助，哈希映射也没有帮助。它们对于像这样的空间、高维问题根本没有用处

但真正的问题是对“效率”的要求不够明确，“效率”的速度有多快

我从硬盘上的“美式英语”词典中统一挑选了10000个单词，查找距离为5的集合，产生了大约2000个条目

real 0m2.558s user 0m2.404s sys 0m0.012s 实0m2.558s 用户0m2.404s sys 0m0.012s 所以，问题是，“效率有多高才算足够”？因为你没有指定你的要求，所以我很难知道这个算法是否适合你

兔子洞如果你想要更快的东西，我会这样做

创建VP树、BK树或其他合适的空间索引。对于语料库中的每个单词，如果该单词与索引中的每个单词有合适的最小距离，请将其插入到树中。空间索引是专门为支持此类查询而设计的

最后，您将看到一个包含具有所需最小距离的节点的树。

看看模块。这是如何解决O（n^2）的比较这是这里的根本问题？问题是bk树不适合这个问题，因为它标识的是靠近新插入节点的项，而不是树中相互靠近的项。@DavidRobinson:bk树就可以了。对于每个字，计算该字到表中某个字的最小距离树。如果它至少等于D，则将该单词插入树中。完成后，您将得到一个包含相互之间最小距离为D的节点的BK树。@DietrichEpp:感谢您的建议-下面的实现非常有效！谢谢，但这是O（n^2）（我想？）。我希望有更聪明的东西。@andrewcooke:我猜它仍然比嵌套for循环方法快，但是是的。哦，谢谢，这看起来不错。（对O（n^2）的评论是针对这篇文章的前一个版本）。是的，我想你可能是对的，我太担心速度了（一个简单的解决方案就足够了）.但后来我对这个问题感兴趣是因为它本身…我认为这里的问题是，当计算拼写相似的单词的距离时，树可以节省时间。但实际上你需要的是将发音相似的单词分组，这不是完全相同的问题。这是我对Dietrich上述评论的理解。

import random
import time
from Levenshtein import distance
from bktree import BKtree

with open("/usr/share/dict/words") as inf:
    word_list = [l[:-1] for l in inf]

def remove_duplicates(lst, threshold):
    tr = BKtree(iter(lst), distance, threshold)
    return tr.nodes.keys()

def time_remove_duplicates(n, threshold):
    """Test using n words"""
    nwords = random.sample(word_list, n)
    t = time.time()
    newlst = remove_duplicates(nwords, threshold)
    return len(newlst), time.time() - t

ns = range(1000, 16000, 2000)
results = [time_remove_duplicates(n, 3) for n in ns]
lengths, timings = zip(*results)

from matplotlib import pyplot as plt

plt.plot(ns, timings)
plt.xlabel("Number of strings")
plt.ylabel("Time (s)")
plt.savefig("number_vs_time.pdf")

import Levenshtein

def simple(corpus, distance):
    words = []
    while corpus:
        center = corpus[0]
        words.append(center)
        corpus = [word for word in corpus
                  if Levenshtein.distance(center, word) >= distance]
    return words

real 0m2.558s user 0m2.404s sys 0m0.012s