在R中相互比较字符串列表_R_Nlp_Levenshtein Distance

在R中相互比较字符串列表

r nlp

在R中相互比较字符串列表,r,nlp,levenshtein-distance,R,Nlp,Levenshtein Distance,我试图使用“RecordLink”包中的函数levenshteinSim来比较字符串列表。但是，我很难想出如何将字符串列表合并到函数中，因为它只需要两个参数str1和str2。我试图找到最理想的方法，因为我的列表包含4k字符串。非常感谢您的帮助以下是一些示例数据： sample <- c('apple', 'appeal', 'apparel', 'peel', 'peer', 'pear') 所以，我想这可能是你想要的。RecordLink软件包不再在CRAN上，因此我选择了另一个软

我试图使用“RecordLink”包中的函数levenshteinSim来比较字符串列表。但是，我很难想出如何将字符串列表合并到函数中，因为它只需要两个参数str1和str2。我试图找到最理想的方法，因为我的列表包含4k字符串。非常感谢您的帮助

以下是一些示例数据：

sample <- c('apple', 'appeal', 'apparel', 'peel', 'peer', 'pear')

所以，我想这可能是你想要的。RecordLink软件包不再在CRAN上，因此我选择了另一个软件包来计算Levenshtein距离：

library(stringdist)

sample <- c('apple', 'appeal', 'apparel', 'peel', 'peer', 'pear')

df <- expand.grid(sample, sample) # this creates a dataframe of all combinations of the sample elements

stringdist(df$Var1, df$Var2, method = "lv")

也许更吸引人一点-dplyr版本：

哪个输出

     Var1  Var2 levenshtein
1   apple apple           0
2  appeal apple           3
3 apparel apple           3
4    peel apple           4
5    peer apple           4
6    pear apple           4
...

所以，我想这可能是你想要的。RecordLink软件包不再在CRAN上，因此我选择了另一个软件包来计算Levenshtein距离：

library(stringdist)

sample <- c('apple', 'appeal', 'apparel', 'peel', 'peer', 'pear')

df <- expand.grid(sample, sample) # this creates a dataframe of all combinations of the sample elements

stringdist(df$Var1, df$Var2, method = "lv")

也许更吸引人一点-dplyr版本：

哪个输出

     Var1  Var2 levenshtein
1   apple apple           0
2  appeal apple           3
3 apparel apple           3
4    peel apple           4
5    peer apple           4
6    pear apple           4
...

这是一个求距离矩阵的基R解

z <- Map(utf8ToInt,sample)
dmat <- outer(z,z,FUN = Vectorize(function(x,y) sum(bitwXor(x,y)>0)))

这是一个求距离矩阵的基R解

z <- Map(utf8ToInt,sample)
dmat <- outer(z,z,FUN = Vectorize(function(x,y) sum(bitwXor(x,y)>0)))

如果没有最近从CRAN中删除的RecordLink包，获取levenshtein距离或levenshtein相似性非常简单

在R基中：

sample <- c('apple', 'appeal', 'apparel', 'peel', 'peer', 'pear')
adist(sample)
#>      [,1] [,2] [,3] [,4] [,5] [,6]
#> [1,]    0    3    3    4    4    4
#> [2,]    3    0    3    3    4    3
#> [3,]    3    3    0    4    5    4
#> [4,]    4    3    4    0    1    2
#> [5,]    4    4    5    1    0    1
#> [6,]    4    3    4    2    1    0

如果您希望类似地使用字符串，则可以使用stringsim或stringsimmatrix一次获得所有比较，目前为止，这些比较仅在开发版本中可用；devtools:：安装\u githubmarkvanderloo/stringdist/pkg：

如果您想将其整理成一个整洁的格式，可以执行以下操作：

library(tidyverse)
stringdist::stringsimmatrix(sample, method = "lv", useNames = "strings") %>% 
  as.matrix() %>%
  as_tibble(rownames = "word1") %>% 
  pivot_longer(-word1, names_to = "word2", values_to = "distance")
#> # A tibble: 36 x 3
#>    word1  word2   distance
#>    <chr>  <chr>      <dbl>
#>  1 apple  apple      1    
#>  2 apple  appeal     0.4  
#>  3 apple  apparel    0.4  
#>  4 apple  peel       0.200
#>  5 apple  peer       0.200
#>  6 apple  pear       0.200
#>  7 appeal apple      0.5  
#>  8 appeal appeal     1    
#>  9 appeal apparel    0.5  
#> 10 appeal peel       0.5  
#> # ... with 26 more rows

如果没有最近从CRAN中删除的RecordLink包，获取levenshtein距离或levenshtein相似性非常简单

在R基中：

sample <- c('apple', 'appeal', 'apparel', 'peel', 'peer', 'pear')
adist(sample)
#>      [,1] [,2] [,3] [,4] [,5] [,6]
#> [1,]    0    3    3    4    4    4
#> [2,]    3    0    3    3    4    3
#> [3,]    3    3    0    4    5    4
#> [4,]    4    3    4    0    1    2
#> [5,]    4    4    5    1    0    1
#> [6,]    4    3    4    2    1    0

如果您想将其整理成一个整洁的格式，可以执行以下操作：

library(tidyverse)
stringdist::stringsimmatrix(sample, method = "lv", useNames = "strings") %>% 
  as.matrix() %>%
  as_tibble(rownames = "word1") %>% 
  pivot_longer(-word1, names_to = "word2", values_to = "distance")
#> # A tibble: 36 x 3
#>    word1  word2   distance
#>    <chr>  <chr>      <dbl>
#>  1 apple  apple      1    
#>  2 apple  appeal     0.4  
#>  3 apple  apparel    0.4  
#>  4 apple  peel       0.200
#>  5 apple  peer       0.200
#>  6 apple  pear       0.200
#>  7 appeal apple      0.5  
#>  8 appeal appeal     1    
#>  9 appeal apparel    0.5  
#> 10 appeal peel       0.5  
#> # ... with 26 more rows

想将其中一个答案标记为答案吗？想将其中一个答案标记为答案吗？很高兴看到adist，喜欢它！很高兴见到adist，喜欢它！