Regex 如何向R中基于另一列中的字符串的data.table中添加列？_Regex_R_Parsing_Transform_Data.table

Regex 如何向R中基于另一列中的字符串的data.table中添加列？

regex r parsing

Regex 如何向R中基于另一列中的字符串的data.table中添加列？,regex,r,parsing,transform,data.table,Regex,R,Parsing,Transform,Data.table,我想根据另一列中的字符串向data.table添加列。这是我的数据和我尝试过的方法： Params 1: { clientID : 459; time : 1386868908703; version : 6} 2: { clientID : 459; id : 52a9ea8b534b2b0b5000575f; time : 1386868824339; user : 459001} 3:

我想根据另一列中的字符串向data.table添加列。这是我的数据和我尝试过的方法：

Params 1: { clientID : 459; time : 1386868908703; version : 6} 2: { clientID : 459; id : 52a9ea8b534b2b0b5000575f; time : 1386868824339; user : 459001} 3: { clientID : 988; time : 1388939739771} 4: { clientID : 459; id : 52a9ec00b73cbf0b210057e9; time : 1386868810519; user : 459001} 5: { clientID : 459; time : 1388090530634} 我想解析“Params”列中的文本，并基于其中的文本创建新列。例如，我希望有一个名为“user”的新列，在Params字符串中只保存“user:”后面的数字。添加的列应如下所示：

Params user 1: { clientID : 459; time : 1386868908703; version : 6} NA 2: { clientID : 459; id : 52a9ea8b534b2b0b5000575f; time : 1386868824339; user : 459001} 459001 3: { clientID : 988; time : 1388939739771} NA 4: { clientID : 459; id : 52a9ec00b73cbf0b210057e9; time : 1386868810519; user : 459001} 459001 5: { clientID : 459; time : 1388090530634} 459001

我如何解决这个问题？谢谢

这里有一种使用正则表达式执行此任务的方法：

myparse <- function(searchterm, s) {
  res <- rep(NA_character_, length(s)) # NA vector
  idx <- grepl(searchterm, s) # index for strings including the search term
  pattern <- paste0(".*", searchterm, " : ([^;}]+)[;}].*") # regex pattern
  res[idx] <- sub(pattern, "\\1", s[idx]) # extract target string
  return(res)
}

对于没有

user

字段的行，新列包含

NA

：

DT[, user]
# [1] NA       "459001" NA       "459001" NA

我会使用一些外部解析器，例如：

library(yaml)

DT = data.frame(
    Params=c("{ clientID : 459;  time : 1386868908703;  version : 6}","{ clientID : 459;  id : 52a9ea8b534b2b0b5000575f;  time : 1386868824339;  user : 459001}","{ clientID : 988;  time : 1388939739771}","{ clientID : 459;  id : 52a9ec00b73cbf0b210057e9;  time : 1386868810519;  user : 459001}","{ clientID : 459;  time : 1388090530634}"), 
    stringsAsFactors=F
    )

conv.to.yaml <- function(x){
     gsub(';  ','\n',substr(x, 3, nchar(x)-1))
}

tmp <- lapply( DT$Params, function(x) yaml.load(conv.to.yaml(x)) )

谢谢。对于我提供的数据来说，这很好。我需要如何调整正则表达式以允许字符串“{clientID:461；time:1386770861254；type:new；newUser:461002}”，其中包括“type:new”？@Miriam本例的结果是什么，

“type:new”

或

“new”

？列应命名为“type”，值为“new”（如user:@Miriam Try

DT[，type:=myparse（“type”，Params）]

。由于某种原因，如果您对我的字符串使用myparse函数，我不知道这不起作用：

>t myparse（“type”，t）[1]“{clientID:461；time:13866770861254；type:new；newUser:461002}”

返回的整个字符串与myparse（“time”）相反“，t）。知道原因是什么吗？

Error in data.table(list(Params = c("{ clientID : 459;  time : 1386868908703;  version : 6}",  : 
  argument 2 (nrow 2) cannot be recycled without remainder to match longest nrow (5)

myparse <- function(searchterm, s) {
  res <- rep(NA_character_, length(s)) # NA vector
  idx <- grepl(searchterm, s) # index for strings including the search term
  pattern <- paste0(".*", searchterm, " : ([^;}]+)[;}].*") # regex pattern
  res[idx] <- sub(pattern, "\\1", s[idx]) # extract target string
  return(res)
}

DT[, user := myparse("user", Params)]

DT[, user]
# [1] NA       "459001" NA       "459001" NA

library(yaml)

DT = data.frame(
    Params=c("{ clientID : 459;  time : 1386868908703;  version : 6}","{ clientID : 459;  id : 52a9ea8b534b2b0b5000575f;  time : 1386868824339;  user : 459001}","{ clientID : 988;  time : 1388939739771}","{ clientID : 459;  id : 52a9ec00b73cbf0b210057e9;  time : 1386868810519;  user : 459001}","{ clientID : 459;  time : 1388090530634}"), 
    stringsAsFactors=F
    )

conv.to.yaml <- function(x){
     gsub(';  ','\n',substr(x, 3, nchar(x)-1))
}

tmp <- lapply( DT$Params, function(x) yaml.load(conv.to.yaml(x)) )

unames <- unique( unlist(sapply( tmp, names) ) )
res <- as.data.frame(  do.call(rbind, lapply(tmp, function(x)x[unames]) ) )
colnames( res ) <- unames
res

> res
  clientID       time version                       id   user
1      459 -405527905       6                     NULL   NULL
2      459 -405612269    NULL 52a9ea8b534b2b0b5000575f 459001
3      988 1665303163    NULL                     NULL   NULL
4      459 -405626089    NULL 52a9ec00b73cbf0b210057e9 459001
5      459  816094026    NULL                     NULL   NULL