Regex r stringdist或levenshtein.distance替换字符串

Regex r stringdist或levenshtein.distance替换字符串,regex,r,gsub,levenshtein-distance,stringdist,Regex,R,Gsub,Levenshtein Distance,Stringdist,我有一个大的数据集,有大约一百万个观测值,并定义了观测类型。在数据集中,约有900000个观测值具有格式错误的观测值类型,50个可接受的观测值类型约有850个(不正确)变化 keys <- c("DAY", "EVENING","SUNSET", "DUSK","NIGHT", "MIDNIGHT", "TWILIGHT", "DAWN","SUNRISE", "MORNING") entries <- c("Day", "day", "SUNSET/DUSK", "DAYS",

我有一个大的数据集,有大约一百万个观测值,并定义了观测类型。在数据集中,约有900000个观测值具有格式错误的观测值类型,50个可接受的观测值类型约有850个(不正确)变化

keys <- c("DAY", "EVENING","SUNSET", "DUSK","NIGHT", "MIDNIGHT", "TWILIGHT", "DAWN","SUNRISE", "MORNING")

entries <- c("Day", "day", "SUNSET/DUSK", "DAYS", "dayy", "EVEN", "Evening", "early dusk", "late day", "nite", "red dawn", "Evening Sunset", "mid-night", "midnight", "midnite","DAY", "EVENING","SUNSET", "DUSK","NIGHT", "MIDNIGHT", "TWILIGHT", "DAWN","SUNRISE", "MORNING")
键您可以尝试:

library(stringdist)
m <- stringdistmatrix(entries, keys, method = "lv")
a <- keys[apply(m, 1, which.min)]
adist()
文档:

计算字符向量之间的近似字符串距离。这个 距离是广义的Levenshtein(编辑)距离,给出 插入、删除和删除的最小可能加权数 将一个字符串转换为另一个字符串所需的替换

这两种方法产生相同的结果:

> identical(a, b)
#[1] TRUE

您可能需要查看
adist()
的文档。在比较“日落”和“黄昏”与“日落/黄昏”时,您需要指定您认为正确的匹配项,“日落/黄昏”应使用距离法计算为“日落”。数据集的性质使我无法确定“黄昏”还是“日落”更合适,我大声欢呼,把狗吓坏了!非常感谢你们两位!adist正是我要找的!灿烂的笑容。谢谢。漂亮优雅的解决方案!谢谢
> identical(a, b)
#[1] TRUE