R 查找并输入单词(字符串)的质心

R 查找并输入单词(字符串)的质心,r,string,similarity,R,String,Similarity,假设我有以下关于汽车品牌的数据框。如何找到每个品牌(单词)的质心,并将该质心归为最“相似”的单词?为了得到第二列,使用规范化标记pal_ok db <- data.frame(pal1 = c("fiat","fiat","fiat","fiat 1","fiatt","fait","fiaat","renault","renault","renault","renaultt","renault 3","renaultc","remault")) pal1 1

假设我有以下关于汽车品牌的数据框。如何找到每个品牌(单词)的质心,并将该质心归为最“相似”的单词?为了得到第二列,使用规范化标记pal_ok

db <- data.frame(pal1 = c("fiat","fiat","fiat","fiat 1","fiatt","fait","fiaat","renault","renault","renault","renaultt","renault 3","renaultc","remault"))

        pal1
1       fiat
2       fiat
3       fiat
4     fiat 1
5      fiatt
6       fait
7      fiaat
8    renault
9    renault
10   renault
11  renaultt
12 renault 3
13  renaultc
14   remault

db <- data.frame(pal1 = c("fiat","fiat","fiat","fiat 1","fiatt","fait","fiaat","renault","renault","renault","renaultt","renault 3","renaultc","remault"),
               pal_ok  =c("fiat","fiat","fiat","fiat","fiat","fiat","fiat","renault","renault","renault","renault","renault","renault","renault"))

        pal1  pal_ok
1       fiat    fiat
2       fiat    fiat
3       fiat    fiat
4     fiat 1    fiat
5      fiatt    fiat
6       fait    fiat
7      fiaat    fiat
8    renault renault
9    renault renault
10   renault renault
11  renaultt renault
12 renault 3 renault
13  renaultc renault
14   remault renault

db您可以使用基本函数
adist
和一些dplyr链来尝试此操作:

# here you calculate your "centroids", i.e. the most common words if you mean that
pal <- as.data.frame.table(table(db$pal1)) %>%                    # table of freq
       arrange(Freq) %>%                                          # arrange it
       top_n(2)                                                   # take the top 2, consider your
                                                                  # data to choose the tops

 pal
     Var1 Freq
1    fiat    3
2 renault    3 
#在这里,您可以计算“质心”,即最常见的单词(如果您的意思是这样的话)
pal%#频率表
排列(频率)%>%#排列
Topnn(2)占优前2,考虑你的
#选择顶部的数据
朋友
Var1频率
1菲亚特3
2雷诺3
现在我们可以计算每个“质心”和单词之间的距离:

# here the distance 
dist <- data.frame(adist(db$pal1,pal$Var1))

# rename the columns, in this case with only two brands
colnames(dist) <- c('fiat','renault')

 dist
   fiat renault
1     0       5
2     0       5
3     0       5
4     2       6
5     1       5
6     2       5
7     1       5
8     5       0
9     5       0
10    5       0
11    6       1
12    7       2
13    6       1
14    5       1
#这里是距离

dist您如何定义“质心”?作为最常见的词(在本例中为菲亚特和雷诺)。请查看package Stringdist,也许您需要词干。这可能会有帮助
cbind(db,dist) %>%                                               # bind data and freq
mutate(pal_calc = ifelse(fiat<renault,'fiat','renault')) %>%     # choose the bigger 
select(-c(fiat,renault))                                         # remove useless columns            

        pal1  pal_ok pal_calc
1       fiat    fiat     fiat
2       fiat    fiat     fiat
3       fiat    fiat     fiat
4     fiat 1    fiat     fiat
5      fiatt    fiat     fiat
6       fait    fiat     fiat
7      fiaat    fiat     fiat
8    renault renault  renault
9    renault renault  renault
10   renault renault  renault
11  renaultt renault  renault
12 renault 3 renault  renault
13  renaultc renault  renault
14   remault renault  renault