在自定义函数中应用dist函数rowise

在自定义函数中应用dist函数rowise,r,R,这是我之前发表的一篇文章的后续问题——我认为我取得了重大进展,现在问题已经改变了 我有一个“匹配”矩阵,如下所示: [,1] [,2] [1,] 1 2 [2,] 5 6 [3,] 7 8 [4,] 9 10 [5,] 11 13 [6,] 14 15 [7,] 16 17 [8,] 18 19 我还有一个dtm-文档术语矩阵: 1108058_10-K_2005 . . . . . . . 1 . . .

这是我之前发表的一篇文章的后续问题——我认为我取得了重大进展,现在问题已经改变了

我有一个“匹配”矩阵,如下所示:

    [,1] [,2]
[1,]    1    2
[2,]    5    6
[3,]    7    8
[4,]    9   10
[5,]   11   13
[6,]   14   15
[7,]   16   17
[8,]   18   19
我还有一个
dtm
-文档术语矩阵:

1108058_10-K_2005  . . . . . . . 1 . . . . 1 . . . . 1 . .
1108058_10-K_2006  . . . . . . . . . . . . . . . . . . . .
72243_10-K_2005    . . . . . . . . . . . . . . . . . . . .
1352341_10-K_2006  1 . 1 . . 1 . . . . . . . . 1 . . . . .
64040_10-K_2005    . . . . . . . . . . . . . . . . . . . .
64040_10-K_2006    . . . . . . . . . . . . . . . . . . . .
1111247_10-K_2005  . . . . . . . . . . . . . . . . . . . .
1111247_10-K_2006  . . . . 1 . . . . . . . . . . . . . . .
1129425_10-K_2005  . . . . . . . . . . 1 1 . . . . . . . .
1129425_10-K_2006  . . . . . . . . . . . . . . . 1 1 . . .
943894_10-K_2005   . . . . . . . . . . . . . . . . . . . .
943894_10-K/A_2005 . . . . . . . . . . . . . . . . . . . .
943894_10-K_2006   . . . 1 . . . . . 1 . . . . . . . . . .
1176316_10-K_2005  . . . . . . . . . . . . . . . . . . . .
1176316_10-K_2006  . . . . . . 1 . . . . . . . . . . . . .
805305_10-K_2005   . . . . . . . . . . . . . . . . . . . .
805305_10-K_2006   . 1 . . . . . . . . . . . 1 . . . . 1 1
63276_10-K_2005    . . . . . . . . 1 . . . . . . . . . . .
63276_10-K_2006    . . . . . . . . . . . . . . . . . . . .
我可以运行以下
dist
功能:

dist2(dtm[matching[, 1], ], dtm[matching[, 2], ], method = "cosine", norm = "none")
哪些产出:

WARN [2019-09-11 20:51:40] Sparsity will be lost - worth to calculate similarity instead of distance.
8 x 8 Matrix of class "dgeMatrix"
                  1108058_10-K_2006 64040_10-K_2006 1111247_10-K_2006 1129425_10-K_2006
1108058_10-K_2005                 1               1                 1                 1
64040_10-K_2005                   1               1                 1                 1
1111247_10-K_2005                 1               1                 1                 1
1129425_10-K_2005                 1               1                 1                 1
943894_10-K_2005                  1               1                 1                 1
1176316_10-K_2005                 1               1                 1                 1
805305_10-K_2005                  1               1                 1                 1
63276_10-K_2005                   1               1                 1                 1
                  943894_10-K_2006 1176316_10-K_2006 805305_10-K_2006 63276_10-K_2006
1108058_10-K_2005                1                 1                1               1
64040_10-K_2005                  1                 1                1               1
1111247_10-K_2005                1                 1                1               1
1129425_10-K_2005                1                 1                1               1
943894_10-K_2005                 1                 1                1               1
1176316_10-K_2005                1                 1                1               1
805305_10-K_2005                 1                 1                1               1
63276_10-K_2005                  1                 1                1               1
这几乎是我想要的,但不是完全。它仍然在计算“太多”的计算。我想根据
匹配
中的“行”观察值计算
dist2
函数。即计算观察值
1
2
dist2
。然后计算下一个
dist2
进行观察
5
6
,然后计算
7
8
,依此类推

数据:

将其应用于第二行:

  m1 <- as.matrix(dtm[matching[2, ], ])
  dist2(m1, method = "cosine", norm = "none")
编辑: 当我在
匹配数据上应用“full”函数时,我得到如下矩阵:
dist2(dtm[matching[,1],],dtm[matching[,2],],method=rwmd,norm=“none”)

(注意:我使用自定义方法
rwmd
而不是
cosine
,并且我使用文档术语矩阵中的所有数据-我还对数据进行了新的随机抽样,因此此数据与以前的数据不匹配)

这让我得到了我想要的——但计算太多了。也就是说,我只对这个矩阵的
对角线
感兴趣,其中的值是
0.06690147
0.06690147
0.02992449
等等。与此处匹配的
数据中的点相对应:

     [,1] [,2]
[1,]    1    2
[2,]    3    5
[3,]    7    8
[4,]    9   10
[5,]   12   13
[6,]   15   16
[7,]   18   19
这些点对应于
dtm
matix中的行位置

> dtm[,1:10]
19 x 10 sparse Matrix of class "dgCMatrix"
   [[ suppressing 10 column names ‘reacting’, ‘ments’, ‘proper’ ... ]]

1019695_10-K_2005  . . . . . . . . . .
1019695_10-K_2006  . . . . . . . . 1 1
718937_10-K_2005   . . . . . . . . . .
718937_10-K/A_2005 . . . . . . . . . .
718937_10-K_2006   . . . . . . . . . .
1034258_10-K_2006  . . . 1 . . . . . .
708955_10-K_2005   . . . . . . . . . .
708955_10-K_2006   . . . . . . . . . .
923120_10-K_2005   . . . . . . . . . .
923120_10-K_2006   . . . . . . . . . .
923120_10-K/A_2006 . . . . . . . . . .
1020569_10-K_2005  . . . . . . . . . .
1020569_10-K_2006  1 . . . . . 1 . . .
1009463_10-K_2005  . . . . . 1 . . . .
862022_10-K_2005   . . . . . . . . . .
862022_10-K_2006   . . 1 . . . . . . .
868271_10-K_2005   . 1 . . . . . 1 . .
917857_10-K_2005   . . . . . . . . . .
917857_10-K_2006   . . . . 1 . . . . .
也就是说,我应该获得
7
-的结果,这是
dist2
矩阵的对角线

编辑2: 应用所有函数可获得以下结果:

    [,1] [,2]
[1,]    1    2
[2,]    5    6
[3,]    7    8
[4,]    9   10
[5,]   11   13
[6,]   14   15
[7,]   16   17
[8,]   18   19
方法1:

> apply(matching, 1, function(x) dist2(as.matrix(dtm[x,]), method = rwmd, norm = 'none'))
Error in method$dist2(x, y) : 
  inherits(x, "sparseMatrix") && inherits(y, "sparseMatrix") is not TRUE
Called from: method$dist2(x, y)
方法2:

> apply(matching, 1, function(x) dist2((dtm[x,]), method = rwmd, norm = 'none'))
  |====================================================================================================| 100%
  |====================================================================================================| 100%
  |====================================================================================================| 100%
  |====================================================================================================| 100%
  |====================================================================================================| 100%
  |====================================================================================================| 100%
  |====================================================================================================| 100%
                           [,1]                       [,2]                       [,3]
[1,] -0.00000000000000001804112 -0.00000000000000001518568 -0.00000000000000003168025
[2,]  0.06690147056044426499000  0.03183972474513259431905  0.02992448660488894462972
[3,]  0.06690147056044426499000  0.03183972474513259431905  0.02992448660488894462972
[4,] -0.00000000000000002283564 -0.00000000000000001232901 -0.00000000000000003952019
                           [,4]                        [,5]                       [,6]
[1,] -0.00000000000000001162810 -0.000000000000000009077403 -0.00000000000000003039822
[2,]  0.07794911930538156452641  0.016792819916915013161995  0.08875270114006890420644
[3,]  0.07794911930538156452641  0.016792819916915013161995  0.08875270114006890420644
[4,] -0.00000000000000001939834 -0.000000000000000009394918 -0.00000000000000004965902
                           [,7]
[1,] -0.00000000000000001829033
[2,]  0.03438092421044294105803
[3,]  0.03438092421044294105803
[4,] -0.00000000000000001748001

(它给出了对角线的一些正确结果,但也给出了一些额外的结果)

这将循环通过匹配矩阵的
每一行,并执行您所说的行:

apply(matching, 1, function(x) dist2(as.matrix(dtm[x,]), method = 'cosine', norm = 'none'))

     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,]   -2    1    1   -1    1    1    1    0
[2,]    1    1    1    1    1    1    1    1
[3,]    1    1    1    1    1    1    1    1
[4,]    1    1    0   -1   -1    0   -3    1
或者,如果要保持命名约定,可以跳过将
转换为.matrix

res<-apply(matching, 1, function(x) dist2((dtm[x,]), method = 'cosine', norm = 'none'))
res

[[1]]
2 x 2 Matrix of class "dgeMatrix"
                  1108058_10-K_2005 1108058_10-K_2006
1108058_10-K_2005                -2                 1
1108058_10-K_2006                 1                 1

[[2]]
2 x 2 Matrix of class "dgeMatrix"
                64040_10-K_2005 64040_10-K_2006
64040_10-K_2005               1               1
64040_10-K_2006               1               1

#6 more list items...
另外,您尝试在apply语句中传递两个变量
x
y
apply()
只传递一个变量-行向量。相反,您必须子集:

apply(matching, 1, function(x) sum(x[1],x[2]))

[1]  3 11 15 19 24 29 33 37

dist2
来自哪个包?我加载的唯一包是
库(text2vec)
包。其中包含
dist2
功能:查看对我的答案的编辑。我未经测试的答案有效,但可能不是预期的结果。谢谢!我添加了2次编辑,结果都在那里。您可以查看
diag(as.matrix(dist2(dtm[…]))
是否有效。换句话说,你只需要从你喜欢的分析中提取对角线。但我不知道您希望我如何帮助您-您没有提供功能
rwmd
,我无法复制您的数据。编辑这么多问题不是最好的。非常感谢你的帮助!您已经大大减少了计算时间!我使用了
apply(匹配,1,函数(x)dist2((dtm[x,]),method=rwmd,norm='none'))
并将数据向下过滤到第二行,因为这是我想要的值。它仍然需要一些时间(已经运行了4个小时),但至少它没有计算矩阵中每个观测值的
余弦
相似性,现在只计算了大大减少的部分!这里还有一个想法:``for(i in seq_len(nrow(matching)){print(dist2(t(dtm[matching[i,1],])),t(dtm[matching[i,2],]),method=“cosine”,norm=“none”)``。理论上,它应该会产生对角线,尽管格式可能很奇怪。
> apply(matching, 1, function(x) dist2(as.matrix(dtm[x,]), method = rwmd, norm = 'none'))
Error in method$dist2(x, y) : 
  inherits(x, "sparseMatrix") && inherits(y, "sparseMatrix") is not TRUE
Called from: method$dist2(x, y)
> apply(matching, 1, function(x) dist2((dtm[x,]), method = rwmd, norm = 'none'))
  |====================================================================================================| 100%
  |====================================================================================================| 100%
  |====================================================================================================| 100%
  |====================================================================================================| 100%
  |====================================================================================================| 100%
  |====================================================================================================| 100%
  |====================================================================================================| 100%
                           [,1]                       [,2]                       [,3]
[1,] -0.00000000000000001804112 -0.00000000000000001518568 -0.00000000000000003168025
[2,]  0.06690147056044426499000  0.03183972474513259431905  0.02992448660488894462972
[3,]  0.06690147056044426499000  0.03183972474513259431905  0.02992448660488894462972
[4,] -0.00000000000000002283564 -0.00000000000000001232901 -0.00000000000000003952019
                           [,4]                        [,5]                       [,6]
[1,] -0.00000000000000001162810 -0.000000000000000009077403 -0.00000000000000003039822
[2,]  0.07794911930538156452641  0.016792819916915013161995  0.08875270114006890420644
[3,]  0.07794911930538156452641  0.016792819916915013161995  0.08875270114006890420644
[4,] -0.00000000000000001939834 -0.000000000000000009394918 -0.00000000000000004965902
                           [,7]
[1,] -0.00000000000000001829033
[2,]  0.03438092421044294105803
[3,]  0.03438092421044294105803
[4,] -0.00000000000000001748001
apply(matching, 1, function(x) dist2(as.matrix(dtm[x,]), method = 'cosine', norm = 'none'))

     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,]   -2    1    1   -1    1    1    1    0
[2,]    1    1    1    1    1    1    1    1
[3,]    1    1    1    1    1    1    1    1
[4,]    1    1    0   -1   -1    0   -3    1
res<-apply(matching, 1, function(x) dist2((dtm[x,]), method = 'cosine', norm = 'none'))
res

[[1]]
2 x 2 Matrix of class "dgeMatrix"
                  1108058_10-K_2005 1108058_10-K_2006
1108058_10-K_2005                -2                 1
1108058_10-K_2006                 1                 1

[[2]]
2 x 2 Matrix of class "dgeMatrix"
                64040_10-K_2005 64040_10-K_2006
64040_10-K_2005               1               1
64040_10-K_2006               1               1

#6 more list items...
library(abind)
abind::abind(lapply(res, as.matrix), along = 3)

, , 1

                63276_10-K_2005 63276_10-K_2006
63276_10-K_2005              -2               1
63276_10-K_2006               1               1

, , 2

                63276_10-K_2005 63276_10-K_2006
63276_10-K_2005               1               1
63276_10-K_2006               1               1

#6 more matrix slices...
apply(matching, 1, function(x) sum(x[1],x[2]))

[1]  3 11 15 19 24 29 33 37