Regex 基于特定值创建新变量

Regex 基于特定值创建新变量,regex,r,dplyr,stringr,Regex,R,Dplyr,Stringr,我读过正则表达式和Hadley Wickham的stringr和dplyr软件包,但不知道如何实现这一点 我有一个数据帧中的图书馆流通数据,调用号是一个字符变量。我想把首字母的大写字母作为一个新变量,字母和句点之间的数字作为第二个新变量 Call_Num HV5822.H4 C47 Circulating Collection, 3rd Floor QE511.4 .G53 1982 Circulating Collection, 3rd Floor TL515 .M63 Circulating

我读过正则表达式和Hadley Wickham的
stringr
dplyr
软件包,但不知道如何实现这一点

我有一个数据帧中的图书馆流通数据,调用号是一个字符变量。我想把首字母的大写字母作为一个新变量,字母和句点之间的数字作为第二个新变量

Call_Num
HV5822.H4 C47 Circulating Collection, 3rd Floor
QE511.4 .G53 1982 Circulating Collection, 3rd Floor
TL515 .M63 Circulating Collection, 3rd Floor
D753 .F4 Circulating Collection, 3rd Floor
DB89.F7 D4 Circulating Collection, 3rd Floor 
那怎么办

rl <- read.table(header = TRUE, text = "Call_Num
'HV5822.H4 C47 Circulating Collection, 3rd Floor'
                 'QE511.4 .G53 1982 Circulating Collection, 3rd Floor'
                 'TL515 .M63 Circulating Collection, 3rd Floor'
                 'D753 .F4 Circulating Collection, 3rd Floor'
                 'DB89.F7 D4 Circulating Collection, 3rd Floor'",
                 stringsAsFactors = FALSE)
cbind(rl, read.table(text = gsub('([A-Z]+)([0-9]+).*', '\\1 \\2', rl$Call_Num)))

#                                              Call_Num V1   V2
# 1     HV5822.H4 C47 Circulating Collection, 3rd Floor HV 5822
# 2 QE511.4 .G53 1982 Circulating Collection, 3rd Floor QE  511
# 3        TL515 .M63 Circulating Collection, 3rd Floor TL  515
# 4          D753 .F4 Circulating Collection, 3rd Floor  D  753
# 5        DB89.F7 D4 Circulating Collection, 3rd Floor DB   89

rl如果要使用
stringr
,解决方案可能如下所示:

df <- data.frame(Call_Num = c("HV5822.H4 C47 Circulating Collection, 3rd Floor", "QE511.4 .G53 1982 Circulating Collection, 3rd Floor", "TL515 .M63 Circulating Collection, 3rd Floor", "D753 .F4 Circulating Collection, 3rd Floor", "DB89.F7 D4 Circulating Collection, 3rd Floor"))

require(stringr)

matches = str_match(df$Call_Num, "([A-Z]+)(\\d+)\\s*\\.")
df2 <- data.frame(df, letter=matches[,2], number=matches[,3])
df2
##                                                  Call_Num letter number
## 1     HV5822.H4 C47 Circulating Collection, 3rd Floor     HV   5822
## 2 QE511.4 .G53 1982 Circulating Collection, 3rd Floor     QE    511
## 3        TL515 .M63 Circulating Collection, 3rd Floor     TL    515
## 4          D753 .F4 Circulating Collection, 3rd Floor      D    753
## 5        DB89.F7 D4 Circulating Collection, 3rd Floor     DB     89

df使用
stringi
包,这将是一个选项。由于您的目标停留在字符串的开头,
stri\u extract\u first()
将非常有效
[:alpha:{1,}
表示包含多个字母表的字母表序列。使用
stri\u extract\u first()
,您可以识别第一个字母序列。同样,您可以使用
stri\u extract\u first(x,regex=“\\d{1,}”)
找到第一个数字序列


x您可以使用gsubfn软件包中的stripply

library(gsubfn)

m <- strapply(as.character(df$Call_Num), '^([A-Z]+)(\\d+)', 
     ~ c(id = x, num = y), simplify = rbind)

X <- as.data.frame(m, stringsAsFactors = FALSE)

#   id  num
# 1 HV 5822
# 2 QE  511
# 3 TL  515
# 4  D  753
# 5 DB   89
库(gsubfn)

我不清楚你的数据到底是什么样的。您能发布生成您正在处理的数据帧的代码吗?谢谢jazzurro,它工作得很好!这是我为我的特定数据框架“circ_data:circ_data_new”而修改的代码。只有一个小问题——当它创建新变量时,它将它们都变成了因子。你能建议如何将第一个变量设为字符类型,将第二个变量设为整数类型吗?@ConceptDelta感谢你的评论。你想将
用作.character()
并包装代码。例如,
alpha=as.character(stri\u extract\u first(x,regex=“[:alpha:{1,}”)
。希望这对您有所帮助。嗨,Jazzurro。我尝试过:circ\u data@ConceptDelta您的括号太多了。我认为
调用_Num\u alpha=as.character(stri\u extract\u first(circ\u data$Call\u Num,regex=“[:alpha:{:{,}]))
可以。如果需要更多帮助,请告诉我。
library(gsubfn)

m <- strapply(as.character(df$Call_Num), '^([A-Z]+)(\\d+)', 
     ~ c(id = x, num = y), simplify = rbind)

X <- as.data.frame(m, stringsAsFactors = FALSE)

#   id  num
# 1 HV 5822
# 2 QE  511
# 3 TL  515
# 4  D  753
# 5 DB   89