Regex 如何向R中基于另一列中的字符串的data.table中添加列?
我想根据另一列中的字符串向data.table添加列。这是我的数据和我尝试过的方法: Params 1: { clientID : 459; time : 1386868908703; version : 6} 2: { clientID : 459; id : 52a9ea8b534b2b0b5000575f; time : 1386868824339; user : 459001} 3: { clientID : 988; time : 1388939739771} 4: { clientID : 459; id : 52a9ec00b73cbf0b210057e9; time : 1386868810519; user : 459001} 5: { clientID : 459; time : 1388090530634} 我想解析“Params”列中的文本,并基于其中的文本创建新列。例如,我希望有一个名为“user”的新列,在Params字符串中只保存“user:”后面的数字。添加的列应如下所示: Params user 1: { clientID : 459; time : 1386868908703; version : 6} NA 2: { clientID : 459; id : 52a9ea8b534b2b0b5000575f; time : 1386868824339; user : 459001} 459001 3: { clientID : 988; time : 1388939739771} NA 4: { clientID : 459; id : 52a9ec00b73cbf0b210057e9; time : 1386868810519; user : 459001} 459001 5: { clientID : 459; time : 1388090530634} 459001Regex 如何向R中基于另一列中的字符串的data.table中添加列?,regex,r,parsing,transform,data.table,Regex,R,Parsing,Transform,Data.table,我想根据另一列中的字符串向data.table添加列。这是我的数据和我尝试过的方法: Params 1: { clientID : 459; time : 1386868908703; version : 6} 2: { clientID : 459; id : 52a9ea8b534b2b0b5000575f; time : 1386868824339; user : 459001} 3:
我如何解决这个问题?谢谢 这里有一种使用正则表达式执行此任务的方法:
myparse <- function(searchterm, s) {
res <- rep(NA_character_, length(s)) # NA vector
idx <- grepl(searchterm, s) # index for strings including the search term
pattern <- paste0(".*", searchterm, " : ([^;}]+)[;}].*") # regex pattern
res[idx] <- sub(pattern, "\\1", s[idx]) # extract target string
return(res)
}
对于没有user
字段的行,新列包含NA
:
DT[, user]
# [1] NA "459001" NA "459001" NA
我会使用一些外部解析器,例如:
library(yaml)
DT = data.frame(
Params=c("{ clientID : 459; time : 1386868908703; version : 6}","{ clientID : 459; id : 52a9ea8b534b2b0b5000575f; time : 1386868824339; user : 459001}","{ clientID : 988; time : 1388939739771}","{ clientID : 459; id : 52a9ec00b73cbf0b210057e9; time : 1386868810519; user : 459001}","{ clientID : 459; time : 1388090530634}"),
stringsAsFactors=F
)
conv.to.yaml <- function(x){
gsub('; ','\n',substr(x, 3, nchar(x)-1))
}
tmp <- lapply( DT$Params, function(x) yaml.load(conv.to.yaml(x)) )
谢谢。对于我提供的数据来说,这很好。我需要如何调整正则表达式以允许字符串“{clientID:461;time:1386770861254;type:new;newUser:461002}”,其中包括“type:new”?@Miriam本例的结果是什么,
“type:new”
或“new”
?列应命名为“type”,值为“new”(如user:@Miriam TryDT[,type:=myparse(“type”,Params)]
。由于某种原因,如果您对我的字符串使用myparse函数,我不知道这不起作用:>t myparse(“type”,t)[1]“{clientID:461;time:13866770861254;type:new;newUser:461002}”
返回的整个字符串与myparse(“time”)相反“,t)。知道原因是什么吗?
Error in data.table(list(Params = c("{ clientID : 459; time : 1386868908703; version : 6}", :
argument 2 (nrow 2) cannot be recycled without remainder to match longest nrow (5)
myparse <- function(searchterm, s) {
res <- rep(NA_character_, length(s)) # NA vector
idx <- grepl(searchterm, s) # index for strings including the search term
pattern <- paste0(".*", searchterm, " : ([^;}]+)[;}].*") # regex pattern
res[idx] <- sub(pattern, "\\1", s[idx]) # extract target string
return(res)
}
DT[, user := myparse("user", Params)]
DT[, user]
# [1] NA "459001" NA "459001" NA
library(yaml)
DT = data.frame(
Params=c("{ clientID : 459; time : 1386868908703; version : 6}","{ clientID : 459; id : 52a9ea8b534b2b0b5000575f; time : 1386868824339; user : 459001}","{ clientID : 988; time : 1388939739771}","{ clientID : 459; id : 52a9ec00b73cbf0b210057e9; time : 1386868810519; user : 459001}","{ clientID : 459; time : 1388090530634}"),
stringsAsFactors=F
)
conv.to.yaml <- function(x){
gsub('; ','\n',substr(x, 3, nchar(x)-1))
}
tmp <- lapply( DT$Params, function(x) yaml.load(conv.to.yaml(x)) )
unames <- unique( unlist(sapply( tmp, names) ) )
res <- as.data.frame( do.call(rbind, lapply(tmp, function(x)x[unames]) ) )
colnames( res ) <- unames
res
> res
clientID time version id user
1 459 -405527905 6 NULL NULL
2 459 -405612269 NULL 52a9ea8b534b2b0b5000575f 459001
3 988 1665303163 NULL NULL NULL
4 459 -405626089 NULL 52a9ec00b73cbf0b210057e9 459001
5 459 816094026 NULL NULL NULL