Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/r/65.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
R 在数据表中拆分长度可变的字符串_R_Regex_Data.table - Fatal编程技术网

R 在数据表中拆分长度可变的字符串

R 在数据表中拆分长度可变的字符串,r,regex,data.table,R,Regex,Data.table,我想根据另一列中字符串的一部分创建一列 参考列的通用格式为:GB/L 12月31日 在本例中,我想提取单词“Ling”,它的长度不一 到目前为止,我的做法是: library(data.table) d1 <- data.table(MENU_HINT = c("GB / Ling 31st Dec", "GB / Taun 30th Dec", "GB / Ayr 19th Dec", "GB / Ayr 9th

我想根据另一列中字符串的一部分创建一列

参考列的通用格式为:GB/L 12月31日

在本例中,我想提取单词“Ling”,它的长度不一

到目前为止,我的做法是:

library(data.table)
d1 <- data.table(MENU_HINT = 
                 c("GB / Ling 31st Dec", "GB / Taun 30th Dec", 
                   "GB / Ayr 19th Dec", "GB / Ayr 9th Nov", 
                   "GB / ChelmC 29th Sep"), 
             Track = c("Ling", "Taun", "Ayr", "Ayr", "ChelmC"))

#remove all the spaces
d1[, Track2 := gsub("[[:space:]]", "", MENU_HINT)]

# get the position of the first digit
d1[, x := as.numeric(regexpr("[[:digit:]]", Track2)[[1]])]

# get the position of the '/'
d1[, y := as.numeric(regexpr("/", Track2))[[1]]]

# use above to extract the Track
d1[, Track2 := substr(Track2, y + 1, x - 1)]
库(data.table)

d1我们可以使用
sub

d1[, Track2 := sub("\\S+[[:punct:] ]+(\\S+).*", "\\1", MENU_HINT)]
d1[, Track2 := gsub("^[^/]+/\\s*|\\s+.*$", "", MENU_HINT)]
d1
#              MENU_HINT  Track Track2
#1:   GB / Ling 31st Dec   Ling   Ling
#2:   GB / Taun 30th Dec   Taun   Taun
#3:    GB / Ayr 19th Dec    Ayr    Ayr
#4:     GB / Ayr 9th Nov    Ayr    Ayr
#5: GB / ChelmC 29th Sep ChelmC ChelmC
或使用
gsub

d1[, Track2 := sub("\\S+[[:punct:] ]+(\\S+).*", "\\1", MENU_HINT)]
d1[, Track2 := gsub("^[^/]+/\\s*|\\s+.*$", "", MENU_HINT)]
d1
#              MENU_HINT  Track Track2
#1:   GB / Ling 31st Dec   Ling   Ling
#2:   GB / Taun 30th Dec   Taun   Taun
#3:    GB / Ayr 19th Dec    Ayr    Ayr
#4:     GB / Ayr 9th Nov    Ayr    Ayr
#5: GB / ChelmC 29th Sep ChelmC ChelmC

我不会为此使用正则表达式,因为它对于大数据集来说是没有效率的。看起来你要找的单词总是在第二个空格后面。一个非常简单有效的解决方案可能是

d1[, Track2 := tstrsplit(MENU_HINT, " ", fixed = TRUE)[[3]]] 
基准

bigDT <- data.table(MENU_HINT = sample(d1$MENU_HINT, 1e6, replace = TRUE))
microbenchmark::microbenchmark("sub: " = sub("\\S+[[:punct:] ]+(\\S+).*", "\\1", bigDT$MENU_HINT),
                               "gsub: " = gsub("^[^/]+/\\s*|\\s+.*$", "", bigDT$MENU_HINT),
                               "tstrsplit: " = tstrsplit(bigDT$MENU_HINT, " ", fixed = TRUE)[[3]])
# Unit: milliseconds
#        expr       min        lq      mean    median        uq      max neval
#       sub:   982.1185  998.6264 1058.1576 1025.8775 1083.1613 1405.051   100
#      gsub:  1236.9453 1262.6014 1320.4436 1305.6711 1339.2879 1766.027   100
# tstrsplit:   385.4785  452.6476  498.8681  470.8281  537.5499 1044.691   100

bigDT请展示一个小的可复制示例和预期输出查看包的
stru extract
-函数
stringr
@akrun抱歉,现在已经添加了一个小示例。我不会为此使用正则表达式-它对于大数据集来说没有效率。看起来你要找的单词总是在第二个空格后面。一个非常简单有效的解决方案可能是
d1[,Track2:=tstrsplit(MENU_HINT,“,fixed=TRUE)[[3]]]]
@DavidArenburg谢谢David,你的回答实际上比我的700k行数据快了2.5倍。谢谢你的快速回复。”在我的数据中,sub似乎比2快一点。