R 数据帧中的字符串模糊匹配
我有一个包含文章标题和相关url链接的数据框架 我的问题是相应标题的行中不需要url链接,例如:R 数据帧中的字符串模糊匹配,r,fuzzy-logic,stringdist,record-linkage,R,Fuzzy Logic,Stringdist,Record Linkage,我有一个包含文章标题和相关url链接的数据框架 我的问题是相应标题的行中不需要url链接,例如: title | urls Who will be the next president? | https://website/5-ways-to-make-a-cocktail.com 5 ways to make a cocktail | https://w
title | urls
Who will be the next president? | https://website/5-ways-to-make-a-cocktail.com
5 ways to make a cocktail | https://website/who-will-be-the-next-president.com
2 millions raised by this startup | https://website/how-did-you-find-your-house.com
How did you find your house | https://website/2-millions-raised-by-this-startup.com
How did you find your house | https://washingtonpost/article/latest-movies-in-theater.com
Latest movies in Theater | www.newspaper/mynews/what-to-cook-in-summer.com
What to cook in summer | https://website/2-millions-raised-by-this-startup.com
我的猜测是,我需要考虑如此模糊的匹配逻辑,但我不确定如何进行。对于副本,我将只使用unique
函数
我从RecordLinkage
包开始使用levenshteinSim
函数,该函数为每一行提供了一个相似性分数,但很明显,由于行不匹配,各地的相似性分数都很低
我还从
stringdist
软件包中听说了stringdistmatrix
函数,但不确定如何在这里使用它。当然可以优化,但这可能会让您开始:
matcher()
converts将比较两个字符串并生成一个分数matcher()
匹配,并获得最高分数NA
在
R
中:
matcher <- function(needle, haystack) {
### Analyzes the url part, converts them to lower case words
### and calculates a score to return
# convert url
y <- unlist(strsplit(haystack, '/'))
y <- tolower(unlist(strsplit(y[length(y)], '[-.]')))
# convert needle
x <- needle
# sum it up
(z <- (sum(x %in% y) / length(x) + sum(y %in% x) / length(y)) / 2)
}
pairer <- function(titles, urls, threshold = 0.75) {
### Calculates scores for each title -> url combination
result <- vector(length = length(titles))
for (i in seq_along(titles)) {
needle <- tolower(unlist(strsplit(titles[i], ' ')))
scores <- unlist(lapply(urls, function(url) matcher(needle, url)))
high_score <- max(scores)
# above threshold ?
result[i] <- ifelse(high_score >= threshold,
urls[which(scores == high_score)], NA)
}
return(result)
}
df$guess <- pairer(df$title, df$urls)
df
Dos“”的这种结构始终存在,还是这只是您的示例?如果是这样的话,你可以用一个简单的正则表达式来删除它并进行精确匹配。嗨,是的,我知道正则表达式,但不,它变化很大,因为有许多不同的网站:/The你可能应该让你的例子更具代表性,因为现在,为您提供的示例提供解决方案非常容易。@Davidernburg完全同意感谢您的反馈,我编辑了ithey,很抱歉我的回复太晚了!谢谢!我尝试了你的函数,但我得到的回报是“strsplit(dataf$url,“/”)中的错误:非字符参数”,所以不确定我缺少什么。。。
title urls guess
1 Who will be the next president? https://website/5-ways-to-make-a-cocktail.com https://website/who-will-be-the-next-president.com
2 5 ways to make a cocktail https://website/who-will-be-the-next-president.com https://website/5-ways-to-make-a-cocktail.com
3 2 millions raised by this startup https://website/how-did-you-find-your-house.com https://website/2-millions-raised-by-this-startup.com
4 How did you find your house https://website/2-millions-raised-by-this-startup.com https://website/how-did-you-find-your-house.com
5 How did you find your house https://washingtonpost/article/latest-movies-in-theater.com https://website/how-did-you-find-your-house.com
6 Latest movies in Theater www.newspaper/mynews/what-to-cook-in-summer.com https://washingtonpost/article/latest-movies-in-theater.com
7 What to cook in summer https://website/2-millions-raised-by-this-startup.com www.newspaper/mynews/what-to-cook-in-summer.com
>