R 从文本中提取模式
我想提取一些有点复杂的模式 我想从列文本中提取最小5位和最大9位字母数字字符,并在新列中打印这些字符。如果这些字符是多个,我想以逗号分隔的格式执行。其中所有文件均采用逗号分隔格式 该模式以字母或数字开头,但不希望该模式以D或DF开头R 从文本中提取模式,r,regex,R,Regex,我想提取一些有点复杂的模式 我想从列文本中提取最小5位和最大9位字母数字字符,并在新列中打印这些字符。如果这些字符是多个,我想以逗号分隔的格式执行。其中所有文件均采用逗号分隔格式 该模式以字母或数字开头,但不希望该模式以D或DF开头 df = data.frame(Text=c(("in which some columns are 1A265T up for some rows."), ("It's too large to 12345AB MB eye
df = data.frame(Text=c(("in which some columns are 1A265T up for some rows."),
("It's too large to 12345AB MB eyeball in order to identify D12345AB"),
("some data to the axis A6651F correct columns for these rows"),
("Any output that would allow me to identify that AJ_DF125AA12."),
("how do I find some locations 564789.")))`enter code here`
Desired output is:
Text Pattern
1 in which some columns are 1A265T , SDFG123
up for some rows. 1A265T , SDFG123
2 It's too large to 12345AB MB eyeball in order to
identify P12345AB 12345AB
3 some data to the axis A6651F correct columns
for these rows A6651F
4 Any output that would allow me to identify
that AJ_DF125AA12. NA
5 how do I find some locations 564789. 564789
I have use str_detect function.
df %>%
filter(str_detect(text, ".+[A-Z0-9,]+"))
Does anybody know the correct way??
在Base-R中
AllNumbers <- regmatches(df$Text, gregexpr("[A-z0-9]+\\d+[A-z0-9]+", df$Text))
AllNumbers <- sapply(AllNumbers, function(x) gsub("^D[A-z0-9]+","",x) )
AllLengths <- sapply(AllNumbers, nchar)
df$Pattern <- sapply(1:length(AllNumbers), function(x) AllNumbers[[x]][AllLengths[[x]]>=5 & AllLengths[[x]]<=9])
你可以用
df = data.frame(Text=c(("in which some columns are 1A265T , SDFG123 up for some rows."),
("It's too large to 12345AB MB eyeball in order to identify D12345AB"),
("some data to the axis A6651F correct columns for these rows"),
("Any output that would allow me to identify that AJ_DF125AA12."),
("how do I find some locations 564789.")))
df$Pattern <- lapply(str_extract_all(df$Text, "\\b[A-CE-Z0-9][A-Z0-9]{4,8}\\b"), paste, collapse=",")
df[df==''] <- NA
正则表达式匹配
-单词边界\b
-ASCII数字或大写字母,而非[A-CE-Z0-9]
D
-四到八个ASCII数字或大写字母[A-Z0-9]{4,8}
-单词边界\b
\b(?!D)[A-Z0-9]{5,9}\b
请参见其中的
(?!D)
要求下一个字符不应为D
您的df
,并且您的预期输出不匹配。例如,df
包含“眼球以识别D12345AB”,但您的预期输出包含“以识别P12345AB”。是的..我将更正这些..任何限制模式不以D开头的想法??
Text Pattern
1 in which some columns are 1A265T , SDFG123 up for some rows. 1A265T,SDFG123
2 It's too large to 12345AB MB eyeball in order to identify D12345AB 12345AB
3 some data to the axis A6651F correct columns for these rows A6651F
4 Any output that would allow me to identify that AJ_DF125AA12. NA
5 how do I find some locations 564789. 564789
\b(?!D)[A-Z0-9]{5,9}\b