Python中两个列表的全对全比较_Python_Performance_Fuzzywuzzy

Python中两个列表的全对全比较

python performance

Python中两个列表的全对全比较,python,performance,fuzzywuzzy,Python,Performance,Fuzzywuzzy,我正在努力解决一些性能问题。手头的任务是提取两个字符串之间的相似值。为此，我使用了fuzzyfuzzy： from fuzzywuzzy import fuzz print fuzz.ratio("string one", "string two") print fuzz.ratio("string one", "string two which is significantly different") result1 80 result2 38 不过，这没关系。我面临的问题是我有两个列表

我正在努力解决一些性能问题。手头的任务是提取两个字符串之间的相似值。为此，我使用了

fuzzyfuzzy

：

from fuzzywuzzy import fuzz

print fuzz.ratio("string one", "string two")
print fuzz.ratio("string one", "string two which is significantly different")
result1 80
result2 38

不过，这没关系。我面临的问题是我有两个列表，一个有1500行，另一个有几千行。我需要将第一个的所有元素与第二个的所有元素进行比较。for循环中的Simple for将花费大量时间进行计算

如果有人建议我如何加速，我将不胜感激。

如果您需要计算每个语句出现的次数，那么不，我知道没有办法比比较每个列表中的元素所需的n^2操作获得巨大的加速。通过使用长度来排除可能发生匹配的可能性，可以避免某些字符串匹配，但仍然有嵌套for循环。您可能会花更多的时间来优化它，而不是它为您节省的处理时间。

我自己为您做了一些东西（python 2.7）：

我能想到的最好的解决方案是使用来并行化基本上不可避免的O（n^2）解

使用该框架，您将能够编写与此类似的单线程内核

def matchStatements(tweet, statements):
    results = []
    for s in statements:
        r = fuzz.ratio(tweet, s)
        results.append(r)
    return results

def main():
    topo = Topology("tweet_compare")
    source = topo.source(getTweets)
    cpuCores = 4
    match = source.parallel(cpuCores).transform(matchStatements)
    end = match.end_parallel()
    end.sink(print)

然后使用类似的设置将其并行化

def matchStatements(tweet, statements):
    results = []
    for s in statements:
        r = fuzz.ratio(tweet, s)
        results.append(r)
    return results

def main():
    topo = Topology("tweet_compare")
    source = topo.source(getTweets)
    cpuCores = 4
    match = source.parallel(cpuCores).transform(matchStatements)
    end = match.end_parallel()
    end.sink(print)

这种多线程处理大大加快了处理速度，同时节省了您自己实现多线程细节的工作（这是Streams的主要优势）

其思想是，每个tweet都是一个要跨多个处理元素处理的Streams元组

介绍了Streams的Python拓扑框架文档，并特别介绍了

并行运算符。
您可以使用列名称.tolist（）
将列转换为列表，并分配给变量
有一个名为two list similarity
的python包，它比较两列的列表并计算分数
如果您必须逐个比较每个元素与其他元素，那么您所关心的昂贵的O（n^2）双for循环操作是无法避免的。但是，如果您提供更多关于您试图解决的问题、涉及的元素类型以及为什么您认为必须比较每个元素的信息，我们可能能够帮助您进行优化。我们的想法是计算这1500条语句中的每一条在推文列表中出现的次数（其中包含数千个条目）。我非常感谢您的努力，但这并不是我想要的。假设您有两个字符串“相似”和“不同相似”（存在故意拼写错误）您的示例甚至不会返回输出，而fuzzyfuzzy输出50%的相似性。@VnC我认为第二个算法将满足您的标准。我认为有可能@jcolemang，检查我的解决方案：@turkus我在回答中的意思是，您无法获得超过n^2的时间复杂度改进（我应该用更好的措辞）.我相信你的答案是展示如何改进个人比较，而不是改进如何将两个列表匹配在一起的算法。
def matchStatements(tweet, statements):
    results = []
    for s in statements:
        r = fuzz.ratio(tweet, s)
        results.append(r)
    return results

def main():
    topo = Topology("tweet_compare")
    source = topo.source(getTweets)
    cpuCores = 4
    match = source.parallel(cpuCores).transform(matchStatements)
    end = match.end_parallel()
    end.sink(print)