SparkR:2个Spark数据帧中2个变量之间的levenshtein模糊字符串匹配

SparkR:2个Spark数据帧中2个变量之间的levenshtein模糊字符串匹配,r,apache-spark,levenshtein-distance,sparkr,sparklyr,R,Apache Spark,Levenshtein Distance,Sparkr,Sparklyr,我有两个Spark数据帧 library(SparkR); library(magrittr) df1 <- createDataFrame(data.frame(var1 = c("rat", "cat", "bat"))) df2 <- createDataFrame(data.frame(var2 = c("cat3", "bat1", "dog", "toy"))) 请告知 您的错误是引用了执行计划中不存在的表中的列 添加交叉连接将解决以下问题: dist_df <

我有两个Spark数据帧

library(SparkR); library(magrittr)

df1 <- createDataFrame(data.frame(var1 = c("rat", "cat", "bat")))
df2 <- createDataFrame(data.frame(var2 = c("cat3", "bat1", "dog", "toy")))

请告知

您的错误是引用了执行计划中不存在的表中的列

添加
交叉连接将解决以下问题:

dist_df <- df1 %>%
  crossJoin(df2) %>% 
  withColumn("dist", levenshtein(df1$var1, df2$var2)) 
dist_df %>% head()
从这里,您可以使用标准方法()查找最接近的匹配项,例如:

best_matches <- dist_df %>% 
  groupBy("var2") %>% 
  agg(struct(dist_df$dist, dist_df$var1) %>% min() %>% alias("match"))

threshold <- 1  # Maximum match distance to keep

result <- best_matches %>% 
  select(
    best_matches$var2, 
    when(best_matches$match.dist <= threshold, best_matches$match.var1) %>% 
      alias("var1"))

result %>% head()
var2 var1
1只狗蝙蝠
2蝙蝠1蝙蝠
3类3猫
4玩具蝙蝠
dist_df <- df1 %>%
  crossJoin(df2) %>% 
  withColumn("dist", levenshtein(df1$var1, df2$var2)) 
dist_df %>% head()
best_matches <- dist_df %>% 
  groupBy("var2") %>% 
  agg(struct(dist_df$dist, dist_df$var1) %>% min() %>% alias("match"))

threshold <- 1  # Maximum match distance to keep

result <- best_matches %>% 
  select(
    best_matches$var2, 
    when(best_matches$match.dist <= threshold, best_matches$match.var1) %>% 
      alias("var1"))

result %>% head()
best_matches %>% select(best_matches$var2, best_matches$match.var1) %>% head()