Python 欧氏距离的高效精确计算_Python_Python 3.x_Euclidean Distance

Python 欧氏距离的高效精确计算

python python-3.x

Python 欧氏距离的高效精确计算,python,python-3.x,euclidean-distance,Python,Python 3.x,Euclidean Distance,通过一些在线研究（，，），我找到了几种用Python计算欧几里德距离的方法： # 1 numpy.linalg.norm(a-b) # 2 distance.euclidean(vector1, vector2) # 3 sklearn.metrics.pairwise.euclidean_distances # 4 sqrt((xa-xb)^2 + (ya-yb)^2 + (za-zb)^2) # 5 dist = [(a - b)**2 for a, b in zip(vecto

通过一些在线研究（，，），我找到了几种用Python计算欧几里德距离的方法：

# 1
numpy.linalg.norm(a-b)

# 2
distance.euclidean(vector1, vector2)

# 3
sklearn.metrics.pairwise.euclidean_distances  

# 4
sqrt((xa-xb)^2 + (ya-yb)^2 + (za-zb)^2)

# 5
dist = [(a - b)**2 for a, b in zip(vector1, vector2)]
dist = math.sqrt(sum(dist))

# 6
math.hypot(x, y)

我想知道是否有人能提供一个见解，在效率和精度方面，上面哪一项（或我没有发现的任何其他内容）被认为是最好的。如果有人知道讨论该主题的任何资源，那也会很好

我感兴趣的上下文是计算成对数字元组之间的欧几里德距离，例如

（52、106、35、12）

和

（33、153、75、10）

作为一般经验法则，尽可能坚持

scipy

和

numpy

实现之间的距离，因为它们是矢量化的，比本地Python代码快得多。（主要原因是：在C实现中，矢量化消除了循环所带来的类型检查开销。）

（旁白：我的回答不包括精度，但我认为精度和效率的原则是一样的。）

作为奖励，我将插入一些关于如何评测代码的信息，以衡量效率。如果您使用的是IPython解释器，那么秘诀就是使用

%prun

行魔术

In [1]: import numpy

In [2]: from scipy.spatial import distance

In [3]: c1 = numpy.array((52, 106, 35, 12))

In [4]: c2 = numpy.array((33, 153, 75, 10))

In [5]: %prun distance.euclidean(c1, c2)
         35 function calls in 0.000 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.000    0.000 {built-in method builtins.exec}
        1    0.000    0.000    0.000    0.000 linalg.py:1976(norm)
        1    0.000    0.000    0.000    0.000 {built-in method numpy.core.multiarray.dot}
        6    0.000    0.000    0.000    0.000 {built-in method numpy.core.multiarray.array}
        4    0.000    0.000    0.000    0.000 numeric.py:406(asarray)
        1    0.000    0.000    0.000    0.000 distance.py:232(euclidean)
        2    0.000    0.000    0.000    0.000 distance.py:152(_validate_vector)
        2    0.000    0.000    0.000    0.000 shape_base.py:9(atleast_1d)
        1    0.000    0.000    0.000    0.000 misc.py:11(norm)
        1    0.000    0.000    0.000    0.000 function_base.py:605(asarray_chkfinite)
        2    0.000    0.000    0.000    0.000 numeric.py:476(asanyarray)
        1    0.000    0.000    0.000    0.000 {method 'ravel' of 'numpy.ndarray' objects}
        1    0.000    0.000    0.000    0.000 linalg.py:111(isComplexType)
        1    0.000    0.000    0.000    0.000 <string>:1(<module>)
        2    0.000    0.000    0.000    0.000 {method 'append' of 'list' objects}
        1    0.000    0.000    0.000    0.000 {built-in method builtins.issubclass}
        4    0.000    0.000    0.000    0.000 {built-in method builtins.len}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        2    0.000    0.000    0.000    0.000 {method 'squeeze' of 'numpy.ndarray' objects}


In [6]: %prun numpy.linalg.norm(c1 - c2)
         10 function calls in 0.000 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.000    0.000 {built-in method builtins.exec}
        1    0.000    0.000    0.000    0.000 linalg.py:1976(norm)
        1    0.000    0.000    0.000    0.000 {built-in method numpy.core.multiarray.dot}
        1    0.000    0.000    0.000    0.000 <string>:1(<module>)
        1    0.000    0.000    0.000    0.000 numeric.py:406(asarray)
        1    0.000    0.000    0.000    0.000 {method 'ravel' of 'numpy.ndarray' objects}
        1    0.000    0.000    0.000    0.000 linalg.py:111(isComplexType)
        1    0.000    0.000    0.000    0.000 {built-in method builtins.issubclass}
        1    0.000    0.000    0.000    0.000 {built-in method numpy.core.multiarray.array}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

[1]中的

：导入numpy
在[2]中：从scipy.spatial导入距离
[3]中：c1=numpy.数组（（52106,35,12））
[4]中的c2=numpy.数组（（33153,75,10））
[5]中：%修剪距离。欧几里德（c1，c2）
在0.000秒内调用35个函数
订购人：内部时间
ncalls tottime percall cumtime percall文件名：lineno（函数）
1 0.000 0.000 0.000 0.000{内置方法builtins.exec}
1.0000.0000.0000.0000.000Linalg.py:1976（标准）
1 0.000 0.000 0.000 0.000{内置方法numpy.core.multiarray.dot}
6 0.000 0.000 0.000 0.000{内置方法numpy.core.multiarray.array}
4 0.000 0.000 0.000 0.000数字。py:406（asarray）
1 0.000 0.000 0.000 0.000距离。py:232（欧几里德）
2 0.000 0.000 0.000 0.000 0.000距离。py:152（\u验证\u向量）
2 0.000 0.000 0.000 0.000形状\u底座。py:9（至少1d）
1 0.000 0.000 0.000 0.000杂项付款：11（标准）
1 0.000 0.000 0.000 0.000函数\u base.py:605（asarray\u chkfinite）
2 0.000 0.000 0.000 0.000数值。py:476（asanyarray）
1 0.000 0.000 0.000 0.000{“numpy.ndarray”对象的方法“ravel”}
1 0.000 0.000 0.000 0.000直线长度py:111（isComplexType）
1    0.000    0.000    0.000    0.000 :1()
2 0.000 0.000 0.000 0.000{“列表”对象的“附加”方法}
1 0.000 0.000 0.000 0.000{内置方法内置.issubclass}
4 0.000 0.000 0.000 0.000{内置方法内置.len}
1 0.000 0.000 0.000 0.000{方法'disable'的''lsprof.Profiler'对象}
2 0.000 0.000 0.000 0.000{“numpy.ndarray”对象的方法“挤压”}
在[6]中：%prun numpy.linalg.norm（c1-c2）
在0.000秒内调用10个函数
订购人：内部时间
ncalls tottime percall cumtime percall文件名：lineno（函数）
1 0.000 0.000 0.000 0.000{内置方法builtins.exec}
1.0000.0000.0000.0000.000Linalg.py:1976（标准）
1 0.000 0.000 0.000 0.000{内置方法numpy.core.multiarray.dot}
1    0.000    0.000    0.000    0.000 :1()
1 0.000 0.000 0.000 0.000数字。py:406（asarray）
1 0.000 0.000 0.000 0.000{“numpy.ndarray”对象的方法“ravel”}
1 0.000 0.000 0.000 0.000直线长度py:111（isComplexType）
1 0.000 0.000 0.000 0.000{内置方法内置.issubclass}
1 0.000 0.000 0.000 0.000{内置方法numpy.core.multiarray.array}
1 0.000 0.000 0.000 0.000{方法'disable'的''lsprof.Profiler'对象}

%prun

所做的是告诉您函数调用运行所需的时间，包括一些跟踪以找出瓶颈可能在哪里。在这种情况下，

scipy.space.distance.euclidean

和

numpy.linalg.norm

实现都非常快。假设您定义了一个函数

dist（vect1，vect2）

，您可以使用相同的IPython magic调用来分析。另一个额外的好处是，

%prun

也可以在Jupyter笔记本中工作，您可以通过将

%%prun

设置为该单元格的第一行来分析整个代码单元格，而不仅仅是一个函数。

首先得出结论：通过使用

timeit

进行效率测试的测试结果，我们可以得出关于效率的结论：

Method5（zip，math.sqrt）
Method1（numpy.linalg.norm）
Method2（scipy.space.distance）
Method3（sklearn.metrics.pairwise.euclidean\u distance）
虽然我没有真正测试你的
方法4
，因为它不适用于一般情况，并且通常相当于
方法5
对于其他人来说，非常令人惊讶的是，
Method5
是最快的一种。而对于使用
numpy
的
Method1
，正如我们所期望的，在C语言中进行了大量优化，是第二快的
对于
scipy.spatial.distance
，如果直接转到函数定义，您将看到它实际上正在使用
numpy.linalg.norm
，除了它将在实际
numpy.linalg.norm
之前对两个输入向量执行验证之外。这就是为什么它略慢于t
numpy.linalg.norm
最后，对于
sklearn
，根据
import numpy as np from scipy.spatial import distance from sklearn.metrics.pairwise import euclidean_distances import math # 1 def eudis1(v1, v2): return np.linalg.norm(v1-v2) # 2 def eudis2(v1, v2): return distance.euclidean(v1, v2) # 3 def eudis3(v1, v2): return euclidean_distances(v1, v2) # 5 def eudis5(v1, v2): dist = [(a - b)**2 for a, b in zip(v1, v2)] dist = math.sqrt(sum(dist)) return dist dis1 = (52, 106, 35, 12) dis2 = (33, 153, 75, 10) v1, v2 = np.array(dis1), np.array(dis2) import timeit def wrapper(func, *args, **kwargs): def wrapped(): return func(*args, **kwargs) return wrapped wrappered1 = wrapper(eudis1, v1, v2) wrappered2 = wrapper(eudis2, v1, v2) wrappered3 = wrapper(eudis3, v1, v2) wrappered5 = wrapper(eudis5, v1, v2) t1 = timeit.repeat(wrappered1, repeat=3, number=100000) t2 = timeit.repeat(wrappered2, repeat=3, number=100000) t3 = timeit.repeat(wrappered3, repeat=3, number=100000) t5 = timeit.repeat(wrappered5, repeat=3, number=100000) print('\n') print('t1: ', sum(t1)/len(t1)) print('t2: ', sum(t2)/len(t2)) print('t3: ', sum(t3)/len(t3)) print('t5: ', sum(t5)/len(t5))

t1: 0.654838958307 t2: 1.53977598714 t3: 6.7898791732 t5: 0.422228400305

In [8]: eudis1(v1,v2) Out[8]: 64.60650122085238 In [9]: eudis2(v1,v2) Out[9]: 64.60650122085238 In [10]: eudis3(v1,v2) Out[10]: array([[ 64.60650122]]) In [11]: eudis5(v1,v2) Out[11]: 64.60650122085238

from math import hypot def pairwise(iterable): "s -> (s0, s1), (s1, s2), (s2, s3), ..." a, b = iter(iterable), iter(iterable) next(b, None) return zip(a, b) a = (52, 106, 35, 12) b = (33, 153, 75, 10) dist = [hypot(p2[0]-p1[0], p2[1]-p1[1]) for p1, p2 in pairwise(tuple(zip(a, b)))] print(dist) # -> [131.59027319676787, 105.47511554864494, 68.94925670375281]

import numpy as np a = np.array([3, 0]) b = np.array([0, 4]) c = np.sqrt(np.sum(((a - b) ** 2))) # c == 5.0