Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/r/76.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
R中两个数据帧之间句子的最接近匹配_R_String_Match - Fatal编程技术网

R中两个数据帧之间句子的最接近匹配

R中两个数据帧之间句子的最接近匹配,r,string,match,R,String,Match,我有两个数据帧。第一个-保存在名为b的对象中: structure(list(CONTENT = c("@myntra beautiful teamä»ç where is the winners list?", "The best ever Puma wishlist for Workout freaks, Head over to @myntra https://t.co/V58Gk3EblW #MyPUMACollection Hit Like if you Find it good"

我有两个数据帧。第一个-保存在名为b的对象中:

structure(list(CONTENT = c("@myntra beautiful teamä»ç where is the winners list?", 
"The best ever Puma wishlist for Workout freaks, Head over to @myntra  https://t.co/V58Gk3EblW #MyPUMACollection Hit Like if you Find it good", 
"I finalised on buy a top from Myntra, and then I found the same top at 20% off in jabong. I feel like I've achieved so much in life!", 
"Check out #myPUMAcollection on @Myntra. Its perfect for a day at gym.  https://t.co/VeRy4G3c7X https://t.co/fOpBRWCdSh", 
"Check out #myPUMAcollection on @Myntra. Its perfect for a day at gym.  https://t.co/VeRy4G3c7X.....", 
"@DrDrupad @myntra #myPUMAcollection superb :)", "Super exclusive collection @myntra #myPUMAcollection   https://t.co/Qm9dZzJdms", 
"@myntra gave my best Love playing wid u Hope to win  #myPUMAcollection", 
"Check out PUMA Unisex Black Running Performance Gloves on Myntra!  https://t.co/YD6IcvuG98 @myntra  #myPUMAcollection", 
"@myntra i have been mailing my issue daily since past week.All i get in reply is an auto generated assurance mail. 1st time pissed wd myntra"
), score = c(7.129, 7.08, 6.676, 5.572, 5.572, 5.535, 5.424, 
5.205, 4.464, 4.245)), .Names = c("CONTENT", "score"), row.names = c(25L, 
103L, 95L, 66L, 90L, 75L, 107L, 32L, 184L, 2L), class = "data.frame")
第二个数据库-保存在名为c的对象中:

structure(list(CONTENT = c("The best ever for workout  over to myntra like if you find it good", 
"i finalised buy a top  myntra and found the at in feel like i so in life"
)), .Names = "CONTENT", row.names = c(103L, 95L), class = "data.frame")
我想为第二个数据帧(c)中的每个语句找到第一个数据帧(b)中最接近的匹配项,并从第一个数据帧(b)返回分数

例如,myntra有史以来最好的训练语句<代码>如果您觉得它很好与数据帧1中的第二个语句非常匹配,因此我应该返回分数<代码>7.080

我尝试使用stack overflow中的代码,并做了一些调整:

cp <- str_split(c$CONTENT, " ")
library(data.table)
nn <- lengths(cp)  ## Or, for < R-3.2.0, `nn <- sapply(wordList, length)` 
dt <- data.table(grp=rep(seq_along(nn), times=nn), X = unlist(cp), key="grp")
dt[,Score:=b$score[pmatch(X,b$CONTENT)]]
dt[!is.na(Score), list(avgScore=sum(Score)), by="grp"]

cp这里有一种方法使用
stringdist
包中的
stringsim
。有几种方法(算法)可供选择——我选择了计算相似度的度量,因为它似乎能为数据生成合理的结果。话虽如此,我在这方面的经验充其量也只是一种偶然的体验,因此您可能需要花一些时间阅读并体验一下
stringdist
提供的各种算法


为了减少混乱,我使用这个包装函数为给定字符串返回最相似(最高相似值)元素的索引

library(stringdist)
library(data.table)

best_match <- function(x, y, method = "jw", ...) {
    which.max(stringsim(x, y, method, ...))
}
使用
best_match
,添加一列,该列具有最佳匹配的索引(然后删除伪
Idx
列)

并从
dfu b
中提取相应的元素(我分别将数据从
b
c
重命名为
dfu b
df\u c
):


下面是一种使用
stringdist
包中的
stringsim
的方法。有几种方法(算法)可供选择——我选择了计算相似度的度量,因为它似乎能为数据生成合理的结果。话虽如此,我在这方面的经验充其量也只是一种偶然的体验,因此您可能需要花一些时间阅读并体验一下
stringdist
提供的各种算法


为了减少混乱,我使用这个包装函数为给定字符串返回最相似(最高相似值)元素的索引

library(stringdist)
library(data.table)

best_match <- function(x, y, method = "jw", ...) {
    which.max(stringsim(x, y, method, ...))
}
使用
best_match
,添加一列,该列具有最佳匹配的索引(然后删除伪
Idx
列)

并从
dfu b
中提取相应的元素(我分别将数据从
b
c
重命名为
dfu b
df\u c
):


您是否致力于采用
stru split
/
pmatch
的方法来确定给定短语的最佳匹配?因为在这种情况下,有合适的模糊匹配算法可以产生更好的结果。@nrussell不是真的……如果你能让我知道可以部署的模糊匹配算法的种类,那会很有帮助。如果你致力于使用
stru split
/
pmatch
的方法来确定一个对象的最佳匹配给定的短语?因为在这种情况下,有合适的模糊匹配算法可以产生更好的结果。@nrussell不是真的……如果您能让我知道可以部署的模糊匹配算法的种类,我会很有帮助的。谢谢nrussell……它在示例集上完美地工作了。我将进一步探讨如何用我的实际数据集实现这一点。再次感谢。@nrussell…我确实通过了Jaro距离…发现非常有趣…感谢你介绍我模糊匹配算法…以前从未知道…将对我非常有帮助。非常感谢nrussell…它与示例集完美结合。我将进一步探讨如何用我的实际数据集实现这一点。再次感谢。@nrussell…我确实通过了Jaro距离…发现非常有趣…感谢你介绍我模糊匹配算法…以前从未知道…将对我非常有帮助。
Dt[, MatchIdx := best_match(df_b$CONTENT, MatchPhrase), 
    by = "Idx"][,Idx := NULL]
Dt[, .(Score = df_b$score[MatchIdx],
       BestMatch = df_b$CONTENT[MatchIdx]),
   by = "MatchPhrase"]
#                                                                MatchPhrase Score
#1:       The best ever for workout  over to myntra like if you find it good 7.080
#2: i finalised buy a top  myntra and found the at in feel like i so in life 6.676

#                                                                                                                                      BestMatch
#1: The best ever Puma wishlist for Workout freaks, Head over to @myntra  https://t.co/V58Gk3EblW #MyPUMACollection Hit Like if you Find it good
#2:         I finalised on buy a top from Myntra, and then I found the same top at 20% off in jabong. I feel like I've achieved so much in life!