Regex 基于特定值创建新变量
我读过正则表达式和Hadley Wickham的Regex 基于特定值创建新变量,regex,r,dplyr,stringr,Regex,R,Dplyr,Stringr,我读过正则表达式和Hadley Wickham的stringr和dplyr软件包,但不知道如何实现这一点 我有一个数据帧中的图书馆流通数据,调用号是一个字符变量。我想把首字母的大写字母作为一个新变量,字母和句点之间的数字作为第二个新变量 Call_Num HV5822.H4 C47 Circulating Collection, 3rd Floor QE511.4 .G53 1982 Circulating Collection, 3rd Floor TL515 .M63 Circulating
stringr
和dplyr
软件包,但不知道如何实现这一点
我有一个数据帧中的图书馆流通数据,调用号是一个字符变量。我想把首字母的大写字母作为一个新变量,字母和句点之间的数字作为第二个新变量
Call_Num
HV5822.H4 C47 Circulating Collection, 3rd Floor
QE511.4 .G53 1982 Circulating Collection, 3rd Floor
TL515 .M63 Circulating Collection, 3rd Floor
D753 .F4 Circulating Collection, 3rd Floor
DB89.F7 D4 Circulating Collection, 3rd Floor
那怎么办
rl <- read.table(header = TRUE, text = "Call_Num
'HV5822.H4 C47 Circulating Collection, 3rd Floor'
'QE511.4 .G53 1982 Circulating Collection, 3rd Floor'
'TL515 .M63 Circulating Collection, 3rd Floor'
'D753 .F4 Circulating Collection, 3rd Floor'
'DB89.F7 D4 Circulating Collection, 3rd Floor'",
stringsAsFactors = FALSE)
cbind(rl, read.table(text = gsub('([A-Z]+)([0-9]+).*', '\\1 \\2', rl$Call_Num)))
# Call_Num V1 V2
# 1 HV5822.H4 C47 Circulating Collection, 3rd Floor HV 5822
# 2 QE511.4 .G53 1982 Circulating Collection, 3rd Floor QE 511
# 3 TL515 .M63 Circulating Collection, 3rd Floor TL 515
# 4 D753 .F4 Circulating Collection, 3rd Floor D 753
# 5 DB89.F7 D4 Circulating Collection, 3rd Floor DB 89
rl如果要使用stringr
,解决方案可能如下所示:
df <- data.frame(Call_Num = c("HV5822.H4 C47 Circulating Collection, 3rd Floor", "QE511.4 .G53 1982 Circulating Collection, 3rd Floor", "TL515 .M63 Circulating Collection, 3rd Floor", "D753 .F4 Circulating Collection, 3rd Floor", "DB89.F7 D4 Circulating Collection, 3rd Floor"))
require(stringr)
matches = str_match(df$Call_Num, "([A-Z]+)(\\d+)\\s*\\.")
df2 <- data.frame(df, letter=matches[,2], number=matches[,3])
df2
## Call_Num letter number
## 1 HV5822.H4 C47 Circulating Collection, 3rd Floor HV 5822
## 2 QE511.4 .G53 1982 Circulating Collection, 3rd Floor QE 511
## 3 TL515 .M63 Circulating Collection, 3rd Floor TL 515
## 4 D753 .F4 Circulating Collection, 3rd Floor D 753
## 5 DB89.F7 D4 Circulating Collection, 3rd Floor DB 89
df使用stringi
包,这将是一个选项。由于您的目标停留在字符串的开头,stri\u extract\u first()
将非常有效[:alpha:{1,}
表示包含多个字母表的字母表序列。使用stri\u extract\u first()
,您可以识别第一个字母序列。同样,您可以使用stri\u extract\u first(x,regex=“\\d{1,}”)
找到第一个数字序列
x您可以使用gsubfn软件包中的stripply:
library(gsubfn)
m <- strapply(as.character(df$Call_Num), '^([A-Z]+)(\\d+)',
~ c(id = x, num = y), simplify = rbind)
X <- as.data.frame(m, stringsAsFactors = FALSE)
# id num
# 1 HV 5822
# 2 QE 511
# 3 TL 515
# 4 D 753
# 5 DB 89
库(gsubfn)
我不清楚你的数据到底是什么样的。您能发布生成您正在处理的数据帧的代码吗?谢谢jazzurro,它工作得很好!这是我为我的特定数据框架“circ_data:circ_data_new”而修改的代码。只有一个小问题——当它创建新变量时,它将它们都变成了因子。你能建议如何将第一个变量设为字符类型,将第二个变量设为整数类型吗?@ConceptDelta感谢你的评论。你想将用作.character()
并包装代码。例如,alpha=as.character(stri\u extract\u first(x,regex=“[:alpha:{1,}”)
。希望这对您有所帮助。嗨,Jazzurro。我尝试过:circ\u data@ConceptDelta您的括号太多了。我认为调用_Num\u alpha=as.character(stri\u extract\u first(circ\u data$Call\u Num,regex=“[:alpha:{:{,}]))
可以。如果需要更多帮助,请告诉我。
library(gsubfn)
m <- strapply(as.character(df$Call_Num), '^([A-Z]+)(\\d+)',
~ c(id = x, num = y), simplify = rbind)
X <- as.data.frame(m, stringsAsFactors = FALSE)
# id num
# 1 HV 5822
# 2 QE 511
# 3 TL 515
# 4 D 753
# 5 DB 89