Python 求偶的凝聚矩阵函数_Python_Algorithm_Math_Statistics_Scipy

Python 求偶的凝聚矩阵函数

python algorithm math statistics

Python 求偶的凝聚矩阵函数,python,algorithm,math,statistics,scipy,Python,Algorithm,Math,Statistics,Scipy,对于一组观察结果： [a1,a2,a3,a4,a5] 它们的成对距离 d=[[0,a12,a13,a14,a15] [a21,0,a23,a24,a25] [a31,a32,0,a34,a35] [a41,a42,a43,0,a45] [a51,a52,a53,a54,0]] 以压缩矩阵形式给出（上面的上三角，根据scipy.space.distance.pdist计算）：问题是，假设我在压缩矩阵中有索引，是否有一个函数（最好是python）f来快速给出用于计算它们

对于一组观察结果：

[a1,a2,a3,a4,a5]

它们的成对距离

d=[[0,a12,a13,a14,a15]
   [a21,0,a23,a24,a25]
   [a31,a32,0,a34,a35]
   [a41,a42,a43,0,a45]
   [a51,a52,a53,a54,0]]

以压缩矩阵形式给出（上面的上三角，根据

scipy.space.distance.pdist

计算）：

问题是，假设我在压缩矩阵中有索引，是否有一个函数（最好是python）f来快速给出用于计算它们的两个观察值

f(c,0)=(1,2)
f(c,5)=(2,4)
f(c,9)=(4,5)
...

我尝试了一些解决方案，但没有一个值得一提：（

Cleary，您正在搜索的函数f，需要第二个参数：矩阵的维数-在您的例子中：5

第一次尝试：

def f(dim,i): 
  d = dim-1 ; s = d
  while i<s: 
    s+=d ; d-=1
  return (dim-d, i-s+d)

def f（尺寸，i）：
d=dim-1；s=d
而我你可能会发现有用。比如
In []: ti= triu_indices(5, 1)
In []: r, c= ti[0][5], ti[1][5]
In []: r, c
Out[]: (1, 3)

请注意，索引从0开始。您可以根据需要进行调整，例如：
In []: def f(n, c):
   ..:     n= ceil(sqrt(2* n))
   ..:     ti= triu_indices(n, 1)
   ..:     return ti[0][c]+ 1, ti[1][c]+ 1
   ..:
In []: f(len(c), 5)
Out[]: (2, 4)

这是对phynfo提供的答案和您的评论的补充。从压缩矩阵的长度推断矩阵的维数对我来说并不是一个干净的设计。也就是说，以下是计算它的方法：
from math import sqrt, ceil

for i in range(1,10):
   thelen = (i * (i+1)) / 2
   thedim = sqrt(2*thelen + ceil(sqrt(2*thelen)))
   print "compressed array of length %d has dimension %d" % (thelen, thedim)

外部平方根的参数应始终为平方整数，但sqrt返回一个浮点数，因此在使用此参数时需要小心。
下面是另一种解决方案：
import numpy as np

def f(c,n):
    tt = np.zeros_like(c)
    tt[n] = 1
    return tuple(np.nonzero(squareform(tt))[0])

使用numpy.triu\u索引提高效率

使用以下命令：
def PdistIndices(n,I):
    '''idx = {} indices for pdist results'''
    idx = numpy.array(numpy.triu_indices(n,1)).T[I]
    return idx

所以I
是一个索引数组
然而一个更好的解决方案是实施优化的暴力搜索，比如在Fortran
中：
function PdistIndices(n,indices,m) result(IJ)
    !IJ = {} indices for pdist[python] selected results[indices]
    implicit none
    integer:: i,j,m,n,k,w,indices(0:m-1),IJ(0:m-1,2)
    logical:: finished
    k = 0; w = 0; finished = .false.
    do i=0,n-2
        do j=i+1,n-1
            if (k==indices(w)) then
                IJ(w,:) = [i,j]
                w = w+1
                if (w==m) then
                    finished = .true.
                    exit
                endif
            endif
            k = k+1
        enddo
        if (finished) then
            exit
        endif
    enddo
end function

然后使用F2PY
进行编译，并享受无与伦比的性能。
压缩矩阵的索引公式为
index = d * (d - 1) / 2 - (d - i) * (d - i - 1) / 2 + j - i - 1

其中，i
是行索引，j
是列索引，d
是原始（d X d）上三角矩阵的行长度
考虑索引引用原始矩阵中某行的最左侧非零项的情况。对于所有最左边的指数
j == i + 1

所以
通过一些代数，我们可以将其改写为
i ** 2 + (1 - (2 * d)) * i + 2 * index == 0

然后我们可以用二次公式来求方程的根，我们只需要
关心积极的根源
如果这个索引确实对应于最左边的非零单元格，那么我们得到一个正整数作为解决方案
对应于行号。然后，查找列号只是算术运算
j = index - d * (d - 1) / 2 + (d - i) * (d - i - 1)/ 2 + i + 1

如果索引与最左边的非零单元格不对应，那么我们将找不到整数根，但可以将正根的底作为行号
def row_col_from_condensed_index(d,index):
    b = 1 - (2 * d) 
    i = (-b - math.sqrt(b ** 2 - 8 * index)) // 2
    j = index + i * (b + i + 2) // 2 + 1
    return (i,j)  

如果你不知道d
，你可以从压缩矩阵的长度计算出来
((d - 1) * d) / 2 == len(condensed_matrix)
d = (1 + math.sqrt(1 + 8 * len(condensed_matrix))) // 2 

要完成此问题的答案列表：快速、矢量化版本的fgreggs答案（如David Marx所建议）可能如下所示：
def vec_row_col(d,i):                                                                
    i = np.array(i)                                                                 
    b = 1 - 2 * d                                                                   
    x = np.floor((-b - np.sqrt(b**2 - 8*i))/2).astype(int)                                      
    y = (i + x*(b + x + 2)/2 + 1).astype(int)                                                    
    if i.shape:                                                                     
        return zip(x,y)                                                             
    else:                                                                           
        return (x,y) 

我需要对大型阵列进行这些计算，与未矢量化的版本（）相比，它的加速（通常）非常令人印象深刻（使用IPython%timeit）：
在本例中，速度大约快37倍
 如果为True，则函数应具有对压缩矩阵的引用。但它应该能够从压缩矩阵的长度推断出维数。不幸的是，这进入了inf loopdim，可以通过求解n inn*（n-1）=len（压缩矩阵）
（或者只保留一个可能/支持的大小的查找表）f（5，1）给出（11，-4）来找到。我不知道我能不能理解里面发生的事情它实际上是（n*（n-1））/2
，虽然它不会放大。超过10k的二维观测值将填满内存@Ηλίας：请详细说明，假设压缩矩阵数据类型是双倍的，那么triu索引将消耗相同的内存量。@eat来自scipy.spatial.distance import pdist
，pdist
将愉快地处理多达10k的数据。您的函数大小将增加到10.000.000。所以我收回我的评论！问题出在pdist@Ηλίας上：你可以在另一个问题上描述你的目标。是否绝对有必要计算所有成对距离？谢谢，毫无疑问，这种解决方案对于中等大小的“n”都是无效的。难道“n=ceil（sqrt（2*len（c））”就足够了吗？@eat:是的，绝对足够了。以上内容过于做作。我花了很长时间才找到这个。你的回答值得更多的关注。注：如果你用math
替换numpy
，你的解决方案实际上是矢量化的。我想问题可能是因为题目不太清楚。你有更好的标题的建议吗？有一个额外的（
）的语法错误。还有为什么i
会有一个形状？压缩距离矩阵总是一个1d数组。回答很好。谢谢。我只是将其修改为返回zip（x，y），以便在列表中获得输出
((d - 1) * d) / 2 == len(condensed_matrix)
d = (1 + math.sqrt(1 + 8 * len(condensed_matrix))) // 2 

def vec_row_col(d,i):                                                                
    i = np.array(i)                                                                 
    b = 1 - 2 * d                                                                   
    x = np.floor((-b - np.sqrt(b**2 - 8*i))/2).astype(int)                                      
    y = (i + x*(b + x + 2)/2 + 1).astype(int)                                                    
    if i.shape:                                                                     
        return zip(x,y)                                                             
    else:                                                                           
        return (x,y) 

import numpy as np
from scipy.spatial import distance

test = np.random.rand(1000,1000)
condense = distance.pdist(test)
sample = np.random.randint(0,len(condense), 1000)

%timeit res = vec_row_col(1000, sample)
10000 loops, best of 3: 156 µs per loop

res = []
%timeit for i in sample: res.append(row_col_from_condensed_index(1000, i))
100 loops, best of 3: 5.87 ms per loop