在自定义函数中应用dist函数rowise
这是我之前发表的一篇文章的后续问题——我认为我取得了重大进展,现在问题已经改变了 我有一个“匹配”矩阵,如下所示:在自定义函数中应用dist函数rowise,r,R,这是我之前发表的一篇文章的后续问题——我认为我取得了重大进展,现在问题已经改变了 我有一个“匹配”矩阵,如下所示: [,1] [,2] [1,] 1 2 [2,] 5 6 [3,] 7 8 [4,] 9 10 [5,] 11 13 [6,] 14 15 [7,] 16 17 [8,] 18 19 我还有一个dtm-文档术语矩阵: 1108058_10-K_2005 . . . . . . . 1 . . .
[,1] [,2]
[1,] 1 2
[2,] 5 6
[3,] 7 8
[4,] 9 10
[5,] 11 13
[6,] 14 15
[7,] 16 17
[8,] 18 19
我还有一个dtm
-文档术语矩阵:
1108058_10-K_2005 . . . . . . . 1 . . . . 1 . . . . 1 . .
1108058_10-K_2006 . . . . . . . . . . . . . . . . . . . .
72243_10-K_2005 . . . . . . . . . . . . . . . . . . . .
1352341_10-K_2006 1 . 1 . . 1 . . . . . . . . 1 . . . . .
64040_10-K_2005 . . . . . . . . . . . . . . . . . . . .
64040_10-K_2006 . . . . . . . . . . . . . . . . . . . .
1111247_10-K_2005 . . . . . . . . . . . . . . . . . . . .
1111247_10-K_2006 . . . . 1 . . . . . . . . . . . . . . .
1129425_10-K_2005 . . . . . . . . . . 1 1 . . . . . . . .
1129425_10-K_2006 . . . . . . . . . . . . . . . 1 1 . . .
943894_10-K_2005 . . . . . . . . . . . . . . . . . . . .
943894_10-K/A_2005 . . . . . . . . . . . . . . . . . . . .
943894_10-K_2006 . . . 1 . . . . . 1 . . . . . . . . . .
1176316_10-K_2005 . . . . . . . . . . . . . . . . . . . .
1176316_10-K_2006 . . . . . . 1 . . . . . . . . . . . . .
805305_10-K_2005 . . . . . . . . . . . . . . . . . . . .
805305_10-K_2006 . 1 . . . . . . . . . . . 1 . . . . 1 1
63276_10-K_2005 . . . . . . . . 1 . . . . . . . . . . .
63276_10-K_2006 . . . . . . . . . . . . . . . . . . . .
我可以运行以下dist
功能:
dist2(dtm[matching[, 1], ], dtm[matching[, 2], ], method = "cosine", norm = "none")
哪些产出:
WARN [2019-09-11 20:51:40] Sparsity will be lost - worth to calculate similarity instead of distance.
8 x 8 Matrix of class "dgeMatrix"
1108058_10-K_2006 64040_10-K_2006 1111247_10-K_2006 1129425_10-K_2006
1108058_10-K_2005 1 1 1 1
64040_10-K_2005 1 1 1 1
1111247_10-K_2005 1 1 1 1
1129425_10-K_2005 1 1 1 1
943894_10-K_2005 1 1 1 1
1176316_10-K_2005 1 1 1 1
805305_10-K_2005 1 1 1 1
63276_10-K_2005 1 1 1 1
943894_10-K_2006 1176316_10-K_2006 805305_10-K_2006 63276_10-K_2006
1108058_10-K_2005 1 1 1 1
64040_10-K_2005 1 1 1 1
1111247_10-K_2005 1 1 1 1
1129425_10-K_2005 1 1 1 1
943894_10-K_2005 1 1 1 1
1176316_10-K_2005 1 1 1 1
805305_10-K_2005 1 1 1 1
63276_10-K_2005 1 1 1 1
这几乎是我想要的,但不是完全。它仍然在计算“太多”的计算。我想根据匹配
中的“行”观察值计算dist2
函数。即计算观察值1
和2
的dist2
。然后计算下一个dist2
进行观察5
和6
,然后计算7
和8
,依此类推
数据:
将其应用于第二行:
m1 <- as.matrix(dtm[matching[2, ], ])
dist2(m1, method = "cosine", norm = "none")
编辑:
当我在匹配数据上应用“full”函数时,我得到如下矩阵:dist2(dtm[matching[,1],],dtm[matching[,2],],method=rwmd,norm=“none”)
(注意:我使用自定义方法rwmd
而不是cosine
,并且我使用文档术语矩阵中的所有数据-我还对数据进行了新的随机抽样,因此此数据与以前的数据不匹配)
这让我得到了我想要的——但计算太多了。也就是说,我只对这个矩阵的对角线
感兴趣,其中的值是0.06690147
,0.06690147
,0.02992449
等等。与此处匹配的数据中的点相对应:
[,1] [,2]
[1,] 1 2
[2,] 3 5
[3,] 7 8
[4,] 9 10
[5,] 12 13
[6,] 15 16
[7,] 18 19
这些点对应于dtm
matix中的行位置
> dtm[,1:10]
19 x 10 sparse Matrix of class "dgCMatrix"
[[ suppressing 10 column names ‘reacting’, ‘ments’, ‘proper’ ... ]]
1019695_10-K_2005 . . . . . . . . . .
1019695_10-K_2006 . . . . . . . . 1 1
718937_10-K_2005 . . . . . . . . . .
718937_10-K/A_2005 . . . . . . . . . .
718937_10-K_2006 . . . . . . . . . .
1034258_10-K_2006 . . . 1 . . . . . .
708955_10-K_2005 . . . . . . . . . .
708955_10-K_2006 . . . . . . . . . .
923120_10-K_2005 . . . . . . . . . .
923120_10-K_2006 . . . . . . . . . .
923120_10-K/A_2006 . . . . . . . . . .
1020569_10-K_2005 . . . . . . . . . .
1020569_10-K_2006 1 . . . . . 1 . . .
1009463_10-K_2005 . . . . . 1 . . . .
862022_10-K_2005 . . . . . . . . . .
862022_10-K_2006 . . 1 . . . . . . .
868271_10-K_2005 . 1 . . . . . 1 . .
917857_10-K_2005 . . . . . . . . . .
917857_10-K_2006 . . . . 1 . . . . .
也就是说,我应该获得7
-的结果,这是dist2
矩阵的对角线
编辑2:
应用所有函数可获得以下结果:
[,1] [,2]
[1,] 1 2
[2,] 5 6
[3,] 7 8
[4,] 9 10
[5,] 11 13
[6,] 14 15
[7,] 16 17
[8,] 18 19
方法1:
> apply(matching, 1, function(x) dist2(as.matrix(dtm[x,]), method = rwmd, norm = 'none'))
Error in method$dist2(x, y) :
inherits(x, "sparseMatrix") && inherits(y, "sparseMatrix") is not TRUE
Called from: method$dist2(x, y)
方法2:
> apply(matching, 1, function(x) dist2((dtm[x,]), method = rwmd, norm = 'none'))
|====================================================================================================| 100%
|====================================================================================================| 100%
|====================================================================================================| 100%
|====================================================================================================| 100%
|====================================================================================================| 100%
|====================================================================================================| 100%
|====================================================================================================| 100%
[,1] [,2] [,3]
[1,] -0.00000000000000001804112 -0.00000000000000001518568 -0.00000000000000003168025
[2,] 0.06690147056044426499000 0.03183972474513259431905 0.02992448660488894462972
[3,] 0.06690147056044426499000 0.03183972474513259431905 0.02992448660488894462972
[4,] -0.00000000000000002283564 -0.00000000000000001232901 -0.00000000000000003952019
[,4] [,5] [,6]
[1,] -0.00000000000000001162810 -0.000000000000000009077403 -0.00000000000000003039822
[2,] 0.07794911930538156452641 0.016792819916915013161995 0.08875270114006890420644
[3,] 0.07794911930538156452641 0.016792819916915013161995 0.08875270114006890420644
[4,] -0.00000000000000001939834 -0.000000000000000009394918 -0.00000000000000004965902
[,7]
[1,] -0.00000000000000001829033
[2,] 0.03438092421044294105803
[3,] 0.03438092421044294105803
[4,] -0.00000000000000001748001
(它给出了对角线的一些正确结果,但也给出了一些额外的结果)这将循环通过匹配矩阵的每一行,并执行您所说的行:
apply(matching, 1, function(x) dist2(as.matrix(dtm[x,]), method = 'cosine', norm = 'none'))
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,] -2 1 1 -1 1 1 1 0
[2,] 1 1 1 1 1 1 1 1
[3,] 1 1 1 1 1 1 1 1
[4,] 1 1 0 -1 -1 0 -3 1
或者,如果要保持命名约定,可以跳过将转换为.matrix
:
res<-apply(matching, 1, function(x) dist2((dtm[x,]), method = 'cosine', norm = 'none'))
res
[[1]]
2 x 2 Matrix of class "dgeMatrix"
1108058_10-K_2005 1108058_10-K_2006
1108058_10-K_2005 -2 1
1108058_10-K_2006 1 1
[[2]]
2 x 2 Matrix of class "dgeMatrix"
64040_10-K_2005 64040_10-K_2006
64040_10-K_2005 1 1
64040_10-K_2006 1 1
#6 more list items...
另外,您尝试在apply语句中传递两个变量x
和y
。apply()
只传递一个变量-行向量。相反,您必须子集:
apply(matching, 1, function(x) sum(x[1],x[2]))
[1] 3 11 15 19 24 29 33 37
dist2
来自哪个包?我加载的唯一包是库(text2vec)
包。其中包含dist2
功能:查看对我的答案的编辑。我未经测试的答案有效,但可能不是预期的结果。谢谢!我添加了2次编辑,结果都在那里。您可以查看diag(as.matrix(dist2(dtm[…]))
是否有效。换句话说,你只需要从你喜欢的分析中提取对角线。但我不知道您希望我如何帮助您-您没有提供功能rwmd
,我无法复制您的数据。编辑这么多问题不是最好的。非常感谢你的帮助!您已经大大减少了计算时间!我使用了apply(匹配,1,函数(x)dist2((dtm[x,]),method=rwmd,norm='none'))
并将数据向下过滤到第二行,因为这是我想要的值。它仍然需要一些时间(已经运行了4个小时),但至少它没有计算矩阵中每个观测值的余弦
相似性,现在只计算了大大减少的部分!这里还有一个想法:``for(i in seq_len(nrow(matching)){print(dist2(t(dtm[matching[i,1],])),t(dtm[matching[i,2],]),method=“cosine”,norm=“none”)``。理论上,它应该会产生对角线,尽管格式可能很奇怪。
> apply(matching, 1, function(x) dist2(as.matrix(dtm[x,]), method = rwmd, norm = 'none'))
Error in method$dist2(x, y) :
inherits(x, "sparseMatrix") && inherits(y, "sparseMatrix") is not TRUE
Called from: method$dist2(x, y)
> apply(matching, 1, function(x) dist2((dtm[x,]), method = rwmd, norm = 'none'))
|====================================================================================================| 100%
|====================================================================================================| 100%
|====================================================================================================| 100%
|====================================================================================================| 100%
|====================================================================================================| 100%
|====================================================================================================| 100%
|====================================================================================================| 100%
[,1] [,2] [,3]
[1,] -0.00000000000000001804112 -0.00000000000000001518568 -0.00000000000000003168025
[2,] 0.06690147056044426499000 0.03183972474513259431905 0.02992448660488894462972
[3,] 0.06690147056044426499000 0.03183972474513259431905 0.02992448660488894462972
[4,] -0.00000000000000002283564 -0.00000000000000001232901 -0.00000000000000003952019
[,4] [,5] [,6]
[1,] -0.00000000000000001162810 -0.000000000000000009077403 -0.00000000000000003039822
[2,] 0.07794911930538156452641 0.016792819916915013161995 0.08875270114006890420644
[3,] 0.07794911930538156452641 0.016792819916915013161995 0.08875270114006890420644
[4,] -0.00000000000000001939834 -0.000000000000000009394918 -0.00000000000000004965902
[,7]
[1,] -0.00000000000000001829033
[2,] 0.03438092421044294105803
[3,] 0.03438092421044294105803
[4,] -0.00000000000000001748001
apply(matching, 1, function(x) dist2(as.matrix(dtm[x,]), method = 'cosine', norm = 'none'))
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,] -2 1 1 -1 1 1 1 0
[2,] 1 1 1 1 1 1 1 1
[3,] 1 1 1 1 1 1 1 1
[4,] 1 1 0 -1 -1 0 -3 1
res<-apply(matching, 1, function(x) dist2((dtm[x,]), method = 'cosine', norm = 'none'))
res
[[1]]
2 x 2 Matrix of class "dgeMatrix"
1108058_10-K_2005 1108058_10-K_2006
1108058_10-K_2005 -2 1
1108058_10-K_2006 1 1
[[2]]
2 x 2 Matrix of class "dgeMatrix"
64040_10-K_2005 64040_10-K_2006
64040_10-K_2005 1 1
64040_10-K_2006 1 1
#6 more list items...
library(abind)
abind::abind(lapply(res, as.matrix), along = 3)
, , 1
63276_10-K_2005 63276_10-K_2006
63276_10-K_2005 -2 1
63276_10-K_2006 1 1
, , 2
63276_10-K_2005 63276_10-K_2006
63276_10-K_2005 1 1
63276_10-K_2006 1 1
#6 more matrix slices...
apply(matching, 1, function(x) sum(x[1],x[2]))
[1] 3 11 15 19 24 29 33 37