Python 计算马氏距离时使用Scipy-Nan

Python 计算马氏距离时使用Scipy-Nan,python,numpy,statistics,scipy,mahalanobis,Python,Numpy,Statistics,Scipy,Mahalanobis,当我试图用下面的python代码计算马氏距离时,我在结果中得到了一些Nan条目。你知道为什么会发生这种情况吗? My data.shape=(1811500) 我还尝试: data_standard = data_centered / data_centered.std(0, ddof=1) D = squareform( pdist(data_standard, 'mahalanobis' ) ) 还有南斯。 输入没有损坏,其他距离(如相关距离)可以很好地计算。 由于某种原因,当我减少功能的

当我试图用下面的python代码计算马氏距离时,我在结果中得到了一些Nan条目。你知道为什么会发生这种情况吗? My data.shape=(1811500)

我还尝试:

data_standard = data_centered / data_centered.std(0, ddof=1)
D = squareform( pdist(data_standard, 'mahalanobis' ) )
还有南斯。 输入没有损坏,其他距离(如相关距离)可以很好地计算。 由于某种原因,当我减少功能的数量时,我不再得到NAN。例如,以下示例未得到任何Nan:

D = squareform( pdist(data_centered[:,:200], 'mahalanobis' ) )
D = squareform( pdist(data_centered[:,180:480], 'mahalanobis' ) )
当其他人得到NAN时:

D = squareform( pdist(data_centered[:,:300], 'mahalanobis' ) )
D = squareform( pdist(data_centered[:,180:600], 'mahalanobis' ) )

有线索吗?如果输入的某些条件不满足,这是预期行为吗

您的观察值少于特征值,因此scipy代码计算的协方差矩阵
V
是奇异的。代码没有检查这一点,而是盲目地计算协方差矩阵的“逆”。因为这个数值计算的逆基本上是垃圾,乘积
(x-y)*inv(V)*(x-y)
(其中
x
y
是观测值)可能是负数。然后该值的平方根产生
nan

例如,此数组还导致出现
nan

In [265]: x
Out[265]: 
array([[-1. ,  0.5,  1. ,  2. ,  2. ],
       [ 2. ,  1. ,  2.5, -1.5,  1. ],
       [ 1.5, -0.5,  1. ,  2. ,  2.5]])

In [266]: squareform(pdist(x, 'mahalanobis'))
Out[266]: 
array([[ 0.        ,         nan,  1.90394328],
       [        nan,  0.        ,         nan],
       [ 1.90394328,         nan,  0.        ]])
以下是“手工”完成的马哈拉诺比计算:

理论上,
V
是单数;以下值实际上为0:

In [280]: np.linalg.det(V)
Out[280]: -2.968550671342364e-47
但是
inv
没有发现问题,并返回一个相反的结果:

In [281]: VI = np.linalg.inv(V)
让我们计算
x[0]
x[2]
之间的距离,并验证在使用
VI
时,我们是否得到了
pdist
返回的相同非nan值(1.9039):

In [295]: delta = x[0] - x[2]

In [296]: np.dot(np.dot(delta, VI), delta)
Out[296]: 3.625

In [297]: np.sqrt(np.dot(np.dot(delta, VI), delta))
Out[297]: 1.9039432764659772
下面是当我们试图计算
x[0]
x[1]
之间的距离时发生的情况:

In [300]: delta = x[0] - x[1]

In [301]: np.dot(np.dot(delta, VI), delta)
Out[301]: -1.75
然后该值的平方根给出
nan


在scipy 0.16(将于2015年6月发布)中,您将得到一个错误,而不是nan或垃圾。错误消息描述了问题:

In [4]: x = array([[-1. ,  0.5,  1. ,  2. ,  2. ],
   ...:        [ 2. ,  1. ,  2.5, -1.5,  1. ],
   ...:        [ 1.5, -0.5,  1. ,  2. ,  2.5]])

In [5]: pdist(x, 'mahalanobis')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-5-a3453ff6fe48> in <module>()
----> 1 pdist(x, 'mahalanobis')

/Users/warren/local_scipy/lib/python2.7/site-packages/scipy/spatial/distance.pyc in pdist(X, metric, p, w, V, VI)
   1298                                      "singular. For observations with %d "
   1299                                      "dimensions, at least %d observations "
-> 1300                                      "are required." % (m, n, n + 1))
   1301                 V = np.atleast_2d(np.cov(X.T))
   1302                 VI = _convert_to_double(np.linalg.inv(V).T.copy())

ValueError: The number of observations (3) is too small; the covariance matrix is singular. For observations with 5 dimensions, at least 6 observations are required.
[4]中的
:x=数组([-1,0.5,1,2,2.],
...:        [ 2. ,  1. ,  2.5, -1.5,  1. ],
...:        [ 1.5, -0.5,  1. ,  2. ,  2.5]])
在[5]中:pdist(x,'马氏')
---------------------------------------------------------------------------
ValueError回溯(最近一次调用上次)
在()
---->1个pdist(x,‘马哈拉诺比斯’)
/pdist中的Users/warren/local_scipy/lib/python2.7/site-packages/scipy/space/distance.pyc(X,metric,p,w,V,VI)
1298“单数。对于带有%d的观测值”
1299“尺寸,至少%d个观察值”
->1300“是必需的。”%(m,n,n+1))
1301 V=np.至少2 d(np.cov(X.T))
1302 VI=\u转换为双精度(np.linalg.inv(V.T.copy())
ValueError:观察次数(3)太少;协方差矩阵是奇异的。对于5维观测,至少需要6个观测值。

0,9,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 0,0,0,0,0,8,0,0,0,0,4,0,0,7,4,0,0,0,0,0,0,1,0,9,0,1,3,0,0,10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0,0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0,0,1,1,0,0,0,1,1,0,0,0,0,0,0,0,5,0,0,0,3,0,0,8,0,0,0,0,0,0,0,0,0,0,0,2,10,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0这是一个相当小的数字适合评论的窗口。你想让我怎么分享?
In [300]: delta = x[0] - x[1]

In [301]: np.dot(np.dot(delta, VI), delta)
Out[301]: -1.75
In [4]: x = array([[-1. ,  0.5,  1. ,  2. ,  2. ],
   ...:        [ 2. ,  1. ,  2.5, -1.5,  1. ],
   ...:        [ 1.5, -0.5,  1. ,  2. ,  2.5]])

In [5]: pdist(x, 'mahalanobis')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-5-a3453ff6fe48> in <module>()
----> 1 pdist(x, 'mahalanobis')

/Users/warren/local_scipy/lib/python2.7/site-packages/scipy/spatial/distance.pyc in pdist(X, metric, p, w, V, VI)
   1298                                      "singular. For observations with %d "
   1299                                      "dimensions, at least %d observations "
-> 1300                                      "are required." % (m, n, n + 1))
   1301                 V = np.atleast_2d(np.cov(X.T))
   1302                 VI = _convert_to_double(np.linalg.inv(V).T.copy())

ValueError: The number of observations (3) is too small; the covariance matrix is singular. For observations with 5 dimensions, at least 6 observations are required.