R 数据帧中的字符串模糊匹配_R_Fuzzy Logic_Stringdist_Record Linkage

R 数据帧中的字符串模糊匹配

R 数据帧中的字符串模糊匹配,r,fuzzy-logic,stringdist,record-linkage,R,Fuzzy Logic,Stringdist,Record Linkage,我有一个包含文章标题和相关url链接的数据框架我的问题是相应标题的行中不需要url链接，例如： title | urls Who will be the next president? | https://website/5-ways-to-make-a-cocktail.com 5 ways to make a cocktail | https://w

我有一个包含文章标题和相关url链接的数据框架

我的问题是相应标题的行中不需要url链接，例如：

               title                  |                     urls
    Who will be the next president?   | https://website/5-ways-to-make-a-cocktail.com 
    5 ways to make a cocktail         | https://website/who-will-be-the-next-president.com
    2 millions raised by this startup | https://website/how-did-you-find-your-house.com 
    How did you find your house       | https://website/2-millions-raised-by-this-startup.com
    How did you find your house       | https://washingtonpost/article/latest-movies-in-theater.com
    Latest movies in Theater          | www.newspaper/mynews/what-to-cook-in-summer.com
    What to cook in summer            | https://website/2-millions-raised-by-this-startup.com

我的猜测是，我需要考虑如此模糊的匹配逻辑，但我不确定如何进行。对于副本，我将只使用

unique

函数

我从

RecordLinkage

包开始使用

levenshteinSim

函数，该函数为每一行提供了一个相似性分数，但很明显，由于行不匹配，各地的相似性分数都很低

我还从

stringdist

软件包中听说了

stringdistmatrix

函数，但不确定如何在这里使用它。

当然可以优化，但这可能会让您开始：

函数

matcher（）

converts将比较两个字符串并生成一个分数

之后，我们将尝试将标题与

matcher（）

匹配，并获得最高分数

如果找不到高于阈值的分数，则产生

NA

在

中：

matcher <- function(needle, haystack) {
  ### Analyzes the url part, converts them to lower case words
  ### and calculates a score to return

  # convert url
  y <- unlist(strsplit(haystack, '/'))
  y <- tolower(unlist(strsplit(y[length(y)], '[-.]')))

  # convert needle
  x <- needle

  # sum it up
  (z <- (sum(x %in% y) / length(x) + sum(y %in% x) / length(y)) / 2)
}

pairer <- function(titles, urls, threshold = 0.75) {
  ### Calculates scores for each title -> url combination
  result <- vector(length = length(titles))
  for (i in seq_along(titles)) {
    needle <- tolower(unlist(strsplit(titles[i], ' ')))
    scores <- unlist(lapply(urls, function(url) matcher(needle, url)))
    high_score <- max(scores)

    # above threshold ?
    result[i] <- ifelse(high_score >= threshold, 
                        urls[which(scores == high_score)], NA)
  }
  return(result)
}

df$guess <- pairer(df$title, df$urls)
df

Dos“”的这种结构始终存在，还是这只是您的示例？如果是这样的话，你可以用一个简单的正则表达式来删除它并进行精确匹配。嗨，是的，我知道正则表达式，但不，它变化很大，因为有许多不同的网站：/The你可能应该让你的例子更具代表性，因为现在，为您提供的示例提供解决方案非常容易。@Davidernburg完全同意感谢您的反馈，我编辑了ithey，很抱歉我的回复太晚了！谢谢！我尝试了你的函数，但我得到的回报是“strsplit（dataf$url，“/”）中的错误：非字符参数”，所以不确定我缺少什么。。。

                              title                                                        urls                                                       guess
1   Who will be the next president?               https://website/5-ways-to-make-a-cocktail.com          https://website/who-will-be-the-next-president.com
2         5 ways to make a cocktail          https://website/who-will-be-the-next-president.com               https://website/5-ways-to-make-a-cocktail.com
3 2 millions raised by this startup             https://website/how-did-you-find-your-house.com       https://website/2-millions-raised-by-this-startup.com
4       How did you find your house       https://website/2-millions-raised-by-this-startup.com             https://website/how-did-you-find-your-house.com
5       How did you find your house https://washingtonpost/article/latest-movies-in-theater.com             https://website/how-did-you-find-your-house.com
6          Latest movies in Theater             www.newspaper/mynews/what-to-cook-in-summer.com https://washingtonpost/article/latest-movies-in-theater.com
7            What to cook in summer       https://website/2-millions-raised-by-this-startup.com             www.newspaper/mynews/what-to-cook-in-summer.com
>