R 扩展以data.table作为参数的函数以使用完整表(而不是子集)
我有一个函数,可以用于一行的data.table(data.frame),但不能用于完整的data.table。我想扩展该函数以考虑input data.table的所有行 论点的要点如下: 字段为字符串的data.table(R 扩展以data.table作为参数的函数以使用完整表(而不是子集),r,indexing,data.table,R,Indexing,Data.table,我有一个函数,可以用于一行的data.table(data.frame),但不能用于完整的data.table。我想扩展该函数以考虑input data.table的所有行 论点的要点如下: 字段为字符串的data.table(tryshort3)需要替换为另一个data.table(mapping)中的另一个字符串,MRE如下所示: #this is the original data.table tryshort3 <- structure(list(country = c("AT",
tryshort3
)需要替换为另一个data.table(mapping
)中的另一个字符串,MRE如下所示:
#this is the original data.table
tryshort3 <- structure(list(country = c("AT", "AT", "MT", "DE", "CH", "XK"
), name = c("ASDF AG", "ASDF GMBH", "ASDF DF", "ASDF KG", "ASDF SA",
"ASDF DAF"), address = c("ACDSTR. 3", "ACDSTR. 4", "ACDSTR. 5",
"ACDSTR. 6", "ACDSTR. 7", "ACDSTR. 8")), .Names = c("country",
"name", "address"), row.names = c(NA, -6L), class = c("data.table",
"data.frame"))
#this is the "mapping
mapping <- structure(list(country = c("AT", "AT", "DE", "DE", "HU"), short.form = c("AG",
"GMBH", "GMBH", "EV", "EV"), long.form = c("AKTIENGESELLSCHAFT",
"GESELLSCHAFT MIT BESCHRANKTER HAFTUNG", "GESELLSCHAFT MIT BESCHRANKTER HAFTUNG",
"EINGETRAGENE VEREIN", "EGYENI VALLALKOZO")), .Names = c("country",
"short.form", "long.form"), row.names = c(NA, -5L), class = c("data.table",
"data.frame"), sorted = "country")
#this is the function that I am using (please not that both data.tables are keyed, but that has currently no say in the output (just avoids throwing an error):
substituting_short_form <- function(input) {
#supply one data.frame of 1 row, the other data.frame is external to the function
#get country from input
setkey(input,country)
setkey(mapping,country)
matched_country <- input$country
#subset of mapping to only the country from the input
matched_map <- mapping[country == matched_country]
#get list of short.forms from matched
list_of_relevant_short_forms <- matched_map[,short.form]
#which one matches will return true if there is any match, THIS IS A NUMBER THAT WILL HAVE TO BE MATCHED TO mapping again to retrieve the correct form
#error catching for when there is no short form found, or no country found if there is no long form it does not matter!
indextrue <- tryCatch(which(unlist(lapply(list_of_relevant_short_forms, function(y) grepl(y, input$name)))), error = function(e) return(input))
#substitute
pattern_to_substitute <- paste0("(\\s|^)", matched_map[indextrue,short.form], "(\\s|$)")
pattern_to_replace <- paste0("\\1", matched_map[indextrue,long.form], "\\2")
input$name[1] <- gsub(pattern = pattern_to_substitute, replacement = pattern_to_replace,input$name , perl = TRUE)
return(input)
}
我希望提供完整的data.table作为输入,并获得相同的输出(相同长度的data.table),以下是我的预期输出:
country name address
1: AT ASDF AKTIENGESELLSCHAFT ACDSTR. 3
2: AT ASDF GESELLSCHAFT MIT BESCHRANKTER HAFTUNG ACDSTR. 4
3: CH ASDF SA ACDSTR. 7
4: DE ASDF KG ACDSTR. 6
5: MT ASDF DF ACDSTR. 5
6: XK ASDF DAF ACDSTR. 8
我想要的解决方案是函数
apply(tryshort3,1,函数(x)替换\u short\u form(x))
中的某些内容,可能使用两个data.tables的索引功能,或者从内部使用nlme
中的gapply
。应用的问题在于它会将其参数强制为矩阵。尝试一个简单的循环:
lst <- list()
for(i in 1:nrow(tryshort3)) lst[[i]] <- substituting_short_form(tryshort3[i,])
rbindlist(lst)
# country name address
# 1: AT ASDF AKTIENGESELLSCHAFT ACDSTR. 3
# 2: AT ASDF GESELLSCHAFT MIT BESCHRANKTER HAFTUNG ACDSTR. 4
# 3: MT ASDF DF ACDSTR. 5
# 4: DE ASDF KG ACDSTR. 6
# 5: CH ASDF SA ACDSTR. 7
# 6: XK ASDF DAF ACDSTR. 8
lst也许您可以尝试以下几个步骤:
# create the shortform variable in tryshort3
tryshort3[, short.form := sub(".+\\s([^s]+)$", "\\1", name)]
# add the info from mapping
tryshort3long <- merge(tryshort3, mapping, all.x=TRUE, by=c("country", "short.form"))
# replace the short form by long form in the name and suppress the variables you don't need
# (thanks to @DavidArenburg for the simplification of the "replace" part!)
tryshort3long[!is.na(long.form),
name := paste(sub(" .*", "", name), long.form)
][, c("long.form", "short.form") := NULL]
tryshort3long
# country name address
# 1: AT ASDF AKTIENGESELLSCHAFT ACDSTR. 3
# 2: AT ASDF GESELLSCHAFT MIT BESCHRANKTER HAFTUNG ACDSTR. 4
# 3: CH ASDF SA ACDSTR. 7
# 4: DE ASDF KG ACDSTR. 6
# 5: MT ASDF DF ACDSTR. 5
# 6: XK ASDF DAF ACDSTR. 8
#在tryshort3中创建shortform变量
tryshort3[,short.form:=sub(“.+\\s([^s]+)$”,“\\1”,name)]
#从映射添加信息
tryshort3long谢谢@David!:-)我感觉到有一种方法可以避免ifelse;-)没有阅读过这里的任何内容,但是如果您已经运行了for
循环,可以查看set
。。。
# create the shortform variable in tryshort3
tryshort3[, short.form := sub(".+\\s([^s]+)$", "\\1", name)]
# add the info from mapping
tryshort3long <- merge(tryshort3, mapping, all.x=TRUE, by=c("country", "short.form"))
# replace the short form by long form in the name and suppress the variables you don't need
# (thanks to @DavidArenburg for the simplification of the "replace" part!)
tryshort3long[!is.na(long.form),
name := paste(sub(" .*", "", name), long.form)
][, c("long.form", "short.form") := NULL]
tryshort3long
# country name address
# 1: AT ASDF AKTIENGESELLSCHAFT ACDSTR. 3
# 2: AT ASDF GESELLSCHAFT MIT BESCHRANKTER HAFTUNG ACDSTR. 4
# 3: CH ASDF SA ACDSTR. 7
# 4: DE ASDF KG ACDSTR. 6
# 5: MT ASDF DF ACDSTR. 5
# 6: XK ASDF DAF ACDSTR. 8