R:Levenschtein匹配的加速代码_R_For Loop_Optimization_Levenshtein Distance

R:Levenschtein匹配的加速代码

r for-loop optimization

R:Levenschtein匹配的加速代码,r,for-loop,optimization,levenshtein-distance,R,For Loop,Optimization,Levenshtein Distance,我有两个包含城市名称的数据表。第一个mydf包含我们要检查的城市列表。它由18990条记录组成。第二个是一个参考表，其中包含353766行两个表的结构如下（总目10）表myref可以包含mydf中不存在的城市作为其参考表。表mydf可以包含在myref中不存在的城市，因为我们正试图确定我们的参考表中缺少什么。表mydf可能包含与我们在myref中的城市拼写略有不同的城市这些城市来自许多国家，因此我使用for循环实现了以下逻辑：在mydf 在mydf中对循环1中的国家/地区的每个城市进行

我有两个包含城市名称的数据表。第一个

mydf

包含我们要检查的城市列表。它由18990条记录组成。第二个是一个参考表，其中包含353766行

两个表的结构如下（总目10）

表myref可以包含

mydf

中不存在的城市作为其参考表。表

mydf

可以包含在

myref

中不存在的城市，因为我们正试图确定我们的参考表中缺少什么。表

mydf

可能包含与我们在

myref

中的城市拼写略有不同的城市

这些城市来自许多国家，因此我使用for循环实现了以下逻辑：

在

mydf

在

mydf

中对循环1中的国家/地区的每个城市进行循环，并在参考表中的同一国家/地区之间进行levenschtein匹配（不同国家/地区可以有类似的城镇，因此这是第一个循环的原因）

在

mydf

表中记录匹配百分比和最相似的城镇，以便在代码运行后进行手动检查

我曾尝试在mydf表中创建第二列，其中包含与该国某个城市的所有可能匹配项，然后运行levenschtein匹配项，但内存不足（该表太大，并且在32位windows笔记本电脑上运行im），这就是我返回for循环的原因，该循环需要几天才能运行。有人能帮忙吗。下面是我的代码（我知道for循环可能是实现这一点的最不理想的方法，因此这应该是一个很好的学习体验），如果需要更多信息，请告诉我

# Initialize variables, my.country contains all the unique countries in     mydf
my.country <- unique(mydf$country)
mydf.sample <- mydf[0, ]
myref.sample <- myref[0, ]
mydf.final <- mydf[0, ] %>%
  mutate(levdist = 0,
      town.match = '')

# For each country, Take each item in mydf, compare it to every record     in the reference table myref
# get the best levenschtein match and score
# add the levenschtein score and the city matched to mydf

for(intcountry in 1:length(my.country))
  {

  # Filter the mydf Table & myRef Table to specific countries based on the intcountry iteration
  mydf.sample <- mydf %>% 
    filter(country == my.country[intcountry])

  myref.sample <- mydf %>% 
    filter(country == my.country[intcountry])

  # Inititalize the temp vector to the size of the mydf size
    vector <- character(length(mydf.sample))

# Set Up Levenschtein Distance
# For every Record in the Dataframe to be checked
for(item in 1:nrow(mydf.sample))
{
  # For every Record in the Reference Table
  for(k in 1:nrow(myref))
  {
    vector[k] = levenshteinSim(mydf.sample$City[item],myref$city[k])
  }
  # Get index of the highest levenschtein match
  max.match.index = match(max(vector),vector)
  mydf.sample$levdist[item] = max(vector)
  mydf.sample$town.match[item] = myref$city[max.match.index]
  vector <- character(length(mydf.sample))
}
mydf.final<- rbind(mydf.sample, mydf.final)
 }

#初始化变量，my.country包含mydf中所有唯一的国家/地区
my.country当速度和内存问题很重要时，data.table
包通常是一个很好的选择。由于您没有提供示例数据来说明问题，因此我创建了一些示例数据（请参见本答案末尾所用数据的dput
s）
1:首先，您必须将数据帧转换为数据表：
library(data.table)
setDT(mydf, key=c("country","city"))
setDT(myref, key=c("country","city"))

使用key=c（“国家”、“城市”）
部分，您还可以为每个数据表创建一个引用
2:现在，您可以轻松删除mydf
中的条目，这些条目也位于参考数据集myref
中，并具有：
mydf <- mydf[!myref]

如您所见，“VILNIUS”的记录（行）已从mydf
中删除，但“VILNUS”的记录/行并非如此，因为它不完全匹配
编辑：我删除了第三个和第四个选项，因为它们似乎不能正常工作

使用数据：
mydf <- structure(list(country = c("LT","GB","LT","LT"), city = c("VILNIUS","LONDON","KAUNAS","VILNUS")), .Names = c("country", "city"), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -1L))
myref <- structure(list(country = c("LT","NL"), city = c("VILNIUS","AMSTERDAM")), .Names = c("country", "city"), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -1L))

mydf您可以尝试n-gram。Levenshtein是后缀数组/树中必不可少的元素，它很难实现。请通读这篇文章，了解如何加速for循环。还有一些data.frames不在示例中。这使得检查您的代码变得很困难hi@Jaap，我尝试使用您的示例setDT（mydf，key=c（“国家”，“城市”））设置密钥。我在setDT（mydf，key=c（“国家”，“城市”））中得到了错误：未使用的参数（key=c（“国家”，“城市”））@JohnSmith-Strange。它在我的系统中工作。您使用的是哪个版本的data.table
？hi@Jaap，上面写着data.table_1.9。4@JohnSmith这是较旧版本的data.table。您必须更新到v1.9.6Hi@Jaap，很抱歉延迟，为了更新，代理设置正在工作。不管怎样，使用你的第二种方法，我们可以指定相似性…所以城镇80%相似？同样，是否有可能在参考表中看到它们与什么匹配，以便分析师可以手动确认？
> mydf
   country   city
1:      GB LONDON
2:      LT KAUNAS
3:      LT VILNUS

mydf <- structure(list(country = c("LT","GB","LT","LT"), city = c("VILNIUS","LONDON","KAUNAS","VILNUS")), .Names = c("country", "city"), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -1L))
myref <- structure(list(country = c("LT","NL"), city = c("VILNIUS","AMSTERDAM")), .Names = c("country", "city"), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -1L))