R 从数据框中计算发音相似的单词的数量和名称

R 从数据框中计算发音相似的单词的数量和名称,r,R,我有一个像这样的数据框 Word<-c("bat", "cat", "cab", "some", "ban", "bait", "at", "done", "dot", "ran", "cant") S1<-c("b","c","c","s", "b", "b", "a", "d","d", "r", "c") S2<-c("a","a","a","o","a","a","t","o","o","a","a") S3<-c("t","t",

我有一个像这样的数据框

    Word<-c("bat", "cat", "cab", "some", "ban", "bait", "at", "done", "dot", "ran", "cant")
    S1<-c("b","c","c","s", "b", "b", "a", "d","d", "r", "c")
    S2<-c("a","a","a","o","a","a","t","o","o","a","a")
    S3<-c("t","t","b","m", "n", "i", "", "n","t", "n", "n")
    S4<-c("","","","e", "", "t", "", "e","", "", "t")
    df<-data.frame(Word, S1, S2, S3, S4,  stringsAsFactors=FALSE)
    Word<-c("bat", "cat", "cab", "some", "ban", "bait", "at", "done", "dot", "ran", "cant")
    S1<-c("b","c","c","s", "b", "b", "a", "d","d", "r", "c")
    S2<-c("a","a","a","o","a","a","t","o","o","a","a")
    S3<-c("t","t","b","m", "n", "i", "", "n","t", "n", "n")
    S4<-c("","","","e", "", "t", "", "e","", "", "t")
    Number<-c(4,4,1,0,2,1,2,0,0,1,2)
    Names<-c("cat, ban, bait, at", "bat, cab, at, cant","cat","","bat, ran","bat","bat, cat","","","ban","can, cat")
    df2<-data.frame(Word, S1, S2, S3, S4, Number, Names,  stringsAsFactors=FALSE)

Word如果我理解正确,你似乎在寻找你的主题词。utils包中的
adist
函数可以为您计算Levenshtein距离。它返回一个矩阵,其中包含从第i个字到第j个字的替换/插入/删除次数

dist <- utils::adist(Word)
dist
然后,可以在行或列上循环并返回距离为1的任何单词:

links <- apply(dist, 1, function(d) {
  paste0(Word[d == 1], collapse = ", ")
})
cbind.data.frame(Word, links)
现在您已经通过编程方式导出了
df2
的第一列和最后一列。对于计数,您可以简单地使用:

counts <- apply(dist, 1, function(d){sum(d == 1)})

counts是的,我正在尝试计算Levenshtein距离,但不是英语。我有另一种语言,其中的单词像“ban1”、“an2”、“dang4”、“sian3”,我想计算Levenshtein距离。当我应用这个代码时,它显示错误“dim(X)必须有一个正长度”。在代码的哪一步它抛出这个错误?你是否有一个引发这个错误的字符的短向量,这样我就可以检查了?还有,如果我想用不同的方案计算相似单词的数量和名称。例如,对于单词“ban1”,通过添加、删除或替换不同于“b”、“an”或“1”的单词。这就是为什么我有像S1、S2、S3、S4这样的列,它们可以帮助定义计算类似发音单词的距离
   Word              links
1   bat cat, ban, bait, at
2   cat bat, cab, at, cant
3   cab                cat
4  some                   
5   ban           bat, ran
6  bait                bat
7    at           bat, cat
8  done                   
9   dot                   
10  ran                ban
11 cant                cat
counts <- apply(dist, 1, function(d){sum(d == 1)})