计算基数R中成对偏相关的最有效方法?

计算基数R中成对偏相关的最有效方法?,r,performance,correlation,R,Performance,Correlation,问题标题说明了一切,计算控制其他变量的矩阵每列之间的成对偏相关的最有效方法是什么 基本上,类似于下面的cor函数,但会产生偏相关,而不是简单的相关 #> cor(iris[,-5]) # Sepal.Length Sepal.Width Petal.Length Petal.Width #Sepal.Length 1.0000000 -0.1175698 0.8717538 0.8179411 #Sepal.Width -0.1175698

问题标题说明了一切,计算控制其他变量的矩阵每列之间的成对偏相关的最有效方法是什么

基本上,类似于下面的
cor
函数,但会产生偏相关,而不是简单的相关

#> cor(iris[,-5])
#             Sepal.Length Sepal.Width Petal.Length Petal.Width
#Sepal.Length    1.0000000  -0.1175698    0.8717538   0.8179411
#Sepal.Width    -0.1175698   1.0000000   -0.4284401  -0.3661259
#Petal.Length    0.8717538  -0.4284401    1.0000000   0.9628654
#Petal.Width     0.8179411  -0.3661259    0.9628654   1.0000000

结果应与我们获得的
ppcor
库相匹配:

#> ppcor::pcor(iris[,-5])$estimate
#             Sepal.Length Sepal.Width Petal.Length Petal.Width
#Sepal.Length    1.0000000   0.6285707    0.7190656  -0.3396174
#Sepal.Width     0.6285707   1.0000000   -0.6152919   0.3526260
#Petal.Length    0.7190656  -0.6152919    1.0000000   0.8707698
#Petal.Width    -0.3396174   0.3526260    0.8707698   1.0000000


其他偏相关系数(即非Pearson)的解决方案也很受欢迎。我们知道,控制其他变量的成对偏相关可以通过在O(n^3)时间内对相关或协方差矩阵(see)进行反演来获得。因此,一个可能的解决方案就是:

pcor.solve = function(x){
  res = solve(cov(x))
  res = -res/sqrt(diag(res) %o% diag(res))
  diag(res) = 1
  return(res)
}
这基本上是一个精简版的
ppcor::pcor
。结果是:

pcor.solve(iris[,-5])
#             Sepal.Length Sepal.Width Petal.Length Petal.Width
#Sepal.Length    1.0000000   0.6285707    0.7190656  -0.3396174
#Sepal.Width     0.6285707   1.0000000   -0.6152919   0.3526260
#Petal.Length    0.7190656  -0.6152919    1.0000000   0.8707698
#Petal.Width    -0.3396174   0.3526260    0.8707698   1.0000000
注意,但是协方差矩阵(或相关矩阵,结果相同)必须是正定的


由于这主要是一个有效的反转操作,所以我在中研究了这个线程
qr.solve
chol2inv
可以在协方差矩阵中使用,效果相同

pcor.qr = function(x){
  res = qr.solve(cov(x))
  res = -res/sqrt(diag(res) %o% diag(res))
  diag(res) = 1
  dimnames(res)[[1]] = dimnames(res)[[2]] = colnames(x)
  return(res)
}
pcor.qr(iris[,-5])
#             Sepal.Length Sepal.Width Petal.Length Petal.Width
#Sepal.Length    1.0000000   0.6285707    0.7190656  -0.3396174
#Sepal.Width     0.6285707   1.0000000   -0.6152919   0.3526260
#Petal.Length    0.7190656  -0.6152919    1.0000000   0.8707698
#Petal.Width    -0.3396174   0.3526260    0.8707698   1.0000000

pcor.chol = function(x){
  res = chol2inv(chol(cov(x)))
  res = -res/sqrt(diag(res) %o% diag(res))
  diag(res) = 1
  dimnames(res)[[1]] = dimnames(res)[[2]] = colnames(x)
  return(res)
}
pcor.chol(iris[,-5])
#             Sepal.Length Sepal.Width Petal.Length Petal.Width
#Sepal.Length    1.0000000   0.6285707    0.7190656  -0.3396174
#Sepal.Width     0.6285707   1.0000000   -0.6152919   0.3526260
#Petal.Length    0.7190656  -0.6152919    1.0000000   0.8707698
#Petal.Width    -0.3396174   0.3526260    0.8707698   1.0000000

更新: 奇异值分解也可用于求解。如果我们有一个正定的平方矩阵,它的奇异值分解是a=UDU^T,它的逆矩阵就是a^-1=UD^-1U^T

pcor.svd = function(x){
  res = svd(cov(x))
  res = res$v %*% diag(1/res$d) %*% t(res$v)
  res = -res/sqrt(diag(res) %o% diag(res))
  diag(res) = 1
  dimnames(res)[[1]] = dimnames(res)[[2]] = colnames(x)
  return(res)
}

pcor.svd(iris[,-5])
#             Sepal.Length Sepal.Width Petal.Length Petal.Width
#Sepal.Length    1.0000000   0.6285707    0.7190656  -0.3396174
#Sepal.Width     0.6285707   1.0000000   -0.6152919   0.3526260
#Petal.Length    0.7190656  -0.6152919    1.0000000   0.8707698
#Petal.Width    -0.3396174   0.3526260    0.8707698   1.0000000

microbenchmark
重复10000次:

library(microbenchmark)
#iris
dt1 = iris[,-5]
microbenchmark(
  ppcor = ppcor::pcor(dt1)$estimate,
  solve = pcor.solve(dt1),
  qr = pcor.qr(dt1),
  chol = pcor.chol(dt1),
  svd = pcor.svd(dt1),
  times = 10000L)

#Unit: microseconds
#  expr     min      lq     mean  median      uq        max neval cld
# ppcor 247.728 267.790 314.8356 280.853 296.248 196962.601 10000   c
# solve 176.816 198.743 217.1298 205.274 221.603   2425.964 10000  b 
#    qr 240.264 258.459 282.7005 270.123 285.518   4015.438 10000   c
#  chol 131.562 148.824 163.3567 154.423 167.019   1593.205 10000 a  
#   svd 179.615 199.675 219.2781 208.074 223.469   1920.710 10000  b 

#random data
dt2 = cbind(rnorm(1E4), rnorm(1E4)+2)
microbenchmark(
  ppcor = ppcor::pcor(dt2)$estimate,
  solve = pcor.solve(dt2),
  qr = pcor.qr(dt2),
  chol = pcor.chol(dt2),
  svd = pcor.svd(dt2),
  times = 10000L)

#Unit: microseconds
#  expr     min      lq     mean  median      uq       max neval  cld
# ppcor 243.063 267.323 306.4535 284.585 311.177  1833.936 10000    d
# solve 180.548 190.812 222.6685 198.277 216.004 84776.704 10000 a   
#    qr 229.068 248.662 282.8142 262.658 285.518  1954.301 10000   c 
#  chol 179.148 189.413 212.6551 198.277 216.005  1383.733 10000 a   
#   svd 213.672 230.933 262.5084 243.529 264.058  5261.543 10000  b  

#uncorrelated data
dt3 = cbind(sin(seq(0, 2*pi, length.out = 1000L)), cos(seq(0, 2*pi, length.out = 1000L)))
microbenchmark(
  ppcor = ppcor::pcor(dt3)$estimate,
  solve = pcor.solve(dt3),
  qr = pcor.qr(dt3),
  chol = pcor.chol(dt3),
  svd = pcor.svd(dt3),
  times = 10000L)

#Unit: microseconds
#  expr     min      lq     mean   median      uq      max neval  cld
# ppcor 142.759 162.354 188.7767 172.1500 191.745 2230.021 10000    d
# solve  80.711  89.108 102.8269  92.3740 101.704 1709.372 10000 a   
#    qr 130.629 145.092 168.0627 153.0220 169.351 4914.910 10000   c 
#  chol  79.777  87.709 102.2984  92.3740 101.238 6731.117 10000 a   
#   svd 112.901 127.363 147.1913 134.1285 148.358 1401.928 10000  b  
[更新]或者,换句话说,现在是
chol
solve
svd
qr
ppcor
。利用协方差矩阵是对称的这一事实(chol的
chol
解决方案已经使用了这一事实),可能可以获得一些速度,并且在协方差计算中也可以获得时间



当然,
ppcor
库的功能更加广泛,可以处理协方差矩阵不可逆等情况,因此在比较中处于劣势。我们也可以这样说,当偏相关将被穷举计算,并且我们知道协方差矩阵是正定的时,我们希望有更简单的解决方案。

谢谢你的下一票!