R 不使用循环的模糊匹配电影标题并按发行日期提取等价标题_R_Matching_String Matching_Levenshtein Distance_Fuzzy Search

R 不使用循环的模糊匹配电影标题并按发行日期提取等价标题

R 不使用循环的模糊匹配电影标题并按发行日期提取等价标题,r,matching,string-matching,levenshtein-distance,fuzzy-search,R,Matching,String Matching,Levenshtein Distance,Fuzzy Search,我正在尝试使用模糊字符串匹配，根据包含电影名称的电影标题列合并两个数据集。下面给出了两个数据集的样本第一个数据集看起来像 itemid userid rating time title release_date 99995 1677 854 3 1997-12-22 sweet nothing 1995 99996 1678 86

我正在尝试使用模糊字符串匹配，根据包含电影名称的电影标题列合并两个数据集。下面给出了两个数据集的样本

第一个数据集看起来像

  itemid userid rating       time                              title release_date
99995    1677    854      3 1997-12-22                      sweet nothing         1995
99996    1678    863      1 1998-03-07                         mat' i syn         1997
99997    1679    863      3 1998-03-07                          b. monkey         1998
99998    1429    863      2 1998-03-07                      sliding doors         1998
99999    1681    896      3 1998-02-11                       you so crazy         1994
100000   1682    916      3 1997-11-29 scream of stone (schrei aus stein)         1991

第二个是

itemid userid rating       time                     title release_date
117201 3175936   9140      3 2013-09-22 bei tou zou de na wu nian         2013
117202 3175936  17439      3 2013-09-18 bei tou zou de na wu nian         2013
117203 3181128   3024      5 2013-09-13                mac & jack         2013
117204 3181962  17310      5 2013-09-19         the last shepherd         2013
117205 3188690  13551      5 2013-09-17     the making of a queen         2013
117206 3198468   5338      3 2013-09-22          north 24 kaatham         2013

dput-df1

structure(list(itemid = c(1677L, 1678L, 1679L, 1429L, 1681L, 
1682L), userid = c(854L, 863L, 863L, 863L, 896L, 916L), rating = c(3L, 
1L, 3L, 2L, 3L, 3L), time = structure(c(10217, 10292, 10292, 
10292, 10268, 10194), class = "Date"), title = c("sweet nothing", 
"mat' i syn", "b. monkey", "sliding doors", "you so crazy", "scream of stone (schrei aus stein)"
), release_date = c("1995", "1997", "1998", "1998", "1994", "1991"
)), .Names = c("itemid", "userid", "rating", "time", "title", 
"release_date"), row.names = 99995:100000, class = "data.frame")

dput-df2

structure(list(itemid = c(3175936L, 3175936L, 3181128L, 3181962L, 
3188690L, 3198468L), userid = c(9140L, 17439L, 3024L, 17310L, 
13551L, 5338L), rating = c(3, 3, 5, 5, 5, 3), time = structure(c(15970, 
15966, 15961, 15967, 15965, 15970), class = "Date"), title = c("bei tou zou de na wu nian", 
"bei tou zou de na wu nian", "mac & jack", "the last shepherd", 
"the making of a queen", "north 24 kaatham"), release_date = c("2013", 
"2013", "2013", "2013", "2013", "2013")), .Names = c("itemid", 
"userid", "rating", "time", "title", "release_date"), row.names = 117201:117206, class = "data.frame")

我想使用levenshteinSim模糊匹配两个数据集中的标题，例如，对于相似性超过0.85的标题，从两个数据集中提取该电影的信息到一个新的数据集中。同时，我需要检查匹配的标题是否有相同的发行日期，因为同名电影可以有多个发行日期

有人能指导我如何完成这项任务吗

到目前为止，我已经尝试了以下代码：

df <- sapply(df1$title,lenvenshteinSim,df2$title)

df 0.85，但这看起来不是一种有效的方法。另外，我无法匹配此代码中的发布日期。
您可以合并这些数据帧
z <- merge(df1,df2,by='release_date',suffixes=c('.df1','.df2'))

使用z$L.dist
，可以筛选所需的行：
subset(z,L.dist > 0.85)

更新
下面是一个使用数据的类似方法。表
，这可能是一个更快的选择：
library(data.table)
d1 <- as.data.table(df1)
d2 <- as.data.table(df2)
setkey(d1,release_date)
setkey(d2,release_date)

z <- d1[d2,allow.cartesian=T,nomatch=F]

#z[,L.dist:=lenvenshteinSim(title,i.title)]
z[,L.dist:=mapply(lenvenshteinSim,title,i.title)]


z[L.dist > 0.8]

库（data.table）
d1能否将数据集的dput（df）
输出添加到问题中？完成。虽然在数据中，样本中没有匹配的电影，但在真实数据集中，有匹配的电影。例如，如果levenshtein相似度>0.80且发布日期相同，则将信息提取到新的DF中，您的方法完全不同，而且非常好。我想问一下，这是否可以应用于每个行数超过10万行的数据集？我正想问你这个问题！：）我想这个算法本身应该是相当有效的，所以我建议尝试一下。我尝试了一下，发现了这个错误“错误：无法分配大小为266.4 Mb的向量。另外：警告消息：1:In[.data.frame
（x，c（m$xi，如果（all.x）m$x.one），c（by.x，seq_len（ncx）[-by.x]）），：已达到5942Mb的总分配：请参阅帮助（memory.size）2:In[.data.frame
（x，c（m$xi，如果（全部.x）仅m$x），c（by.x，seq_len（ncx）[-by.x]），：已达到5942Mb的总分配：请参阅帮助（memory.size）3:In[.data.frame
（x，c（m$xi if（全部.x）仅m$x），c（by.x，seq_len ncx）[-by.x]），：已达到5942Mb的总分配：请参阅帮助（memory.size）我可以减少数据集的维数吗？因为我不需要结果数据集中的所有列，只需要电影和收视率，可能还需要发布数据，尽管这不是必需的。是的，这是个好主意。尝试类似df1.fc的方法
library(data.table)
d1 <- as.data.table(df1)
d2 <- as.data.table(df2)
setkey(d1,release_date)
setkey(d2,release_date)

z <- d1[d2,allow.cartesian=T,nomatch=F]

#z[,L.dist:=lenvenshteinSim(title,i.title)]
z[,L.dist:=mapply(lenvenshteinSim,title,i.title)]


z[L.dist > 0.8]