SparkR:2个Spark数据帧中2个变量之间的levenshtein模糊字符串匹配
我有两个Spark数据帧SparkR:2个Spark数据帧中2个变量之间的levenshtein模糊字符串匹配,r,apache-spark,levenshtein-distance,sparkr,sparklyr,R,Apache Spark,Levenshtein Distance,Sparkr,Sparklyr,我有两个Spark数据帧 library(SparkR); library(magrittr) df1 <- createDataFrame(data.frame(var1 = c("rat", "cat", "bat"))) df2 <- createDataFrame(data.frame(var2 = c("cat3", "bat1", "dog", "toy"))) 请告知 您的错误是引用了执行计划中不存在的表中的列 添加交叉连接将解决以下问题: dist_df <
library(SparkR); library(magrittr)
df1 <- createDataFrame(data.frame(var1 = c("rat", "cat", "bat")))
df2 <- createDataFrame(data.frame(var2 = c("cat3", "bat1", "dog", "toy")))
请告知 您的错误是引用了执行计划中不存在的表中的列 添加
交叉连接将解决以下问题:
dist_df <- df1 %>%
crossJoin(df2) %>%
withColumn("dist", levenshtein(df1$var1, df2$var2))
dist_df %>% head()
从这里,您可以使用标准方法()查找最接近的匹配项,例如:
best_matches <- dist_df %>%
groupBy("var2") %>%
agg(struct(dist_df$dist, dist_df$var1) %>% min() %>% alias("match"))
threshold <- 1 # Maximum match distance to keep
result <- best_matches %>%
select(
best_matches$var2,
when(best_matches$match.dist <= threshold, best_matches$match.var1) %>%
alias("var1"))
result %>% head()
var2 var1
1只狗蝙蝠
2蝙蝠1蝙蝠
3类3猫
4玩具蝙蝠
dist_df <- df1 %>%
crossJoin(df2) %>%
withColumn("dist", levenshtein(df1$var1, df2$var2))
dist_df %>% head()
best_matches <- dist_df %>%
groupBy("var2") %>%
agg(struct(dist_df$dist, dist_df$var1) %>% min() %>% alias("match"))
threshold <- 1 # Maximum match distance to keep
result <- best_matches %>%
select(
best_matches$var2,
when(best_matches$match.dist <= threshold, best_matches$match.var1) %>%
alias("var1"))
result %>% head()
best_matches %>% select(best_matches$var2, best_matches$match.var1) %>% head()