R 从文本中提取模式

R 从文本中提取模式,r,regex,R,Regex,我想提取一些有点复杂的模式 我想从列文本中提取最小5位和最大9位字母数字字符,并在新列中打印这些字符。如果这些字符是多个,我想以逗号分隔的格式执行。其中所有文件均采用逗号分隔格式 该模式以字母或数字开头,但不希望该模式以D或DF开头 df = data.frame(Text=c(("in which some columns are 1A265T up for some rows."), ("It's too large to 12345AB MB eye

我想提取一些有点复杂的模式

我想从列文本中提取最小5位和最大9位字母数字字符,并在新列中打印这些字符。如果这些字符是多个,我想以逗号分隔的格式执行。其中所有文件均采用逗号分隔格式

该模式以字母或数字开头,但不希望该模式以D或DF开头

df = data.frame(Text=c(("in which some columns are 1A265T up for some rows."),
                    ("It's too large to 12345AB MB eyeball in order to identify D12345AB"),
                    ("some data to the axis A6651F correct columns for these rows"),
                    ("Any output that would allow me to identify that AJ_DF125AA12."),
                    ("how do I find some locations 564789.")))`enter code here`  

Desired output is:

       Text                                                   Pattern

 1       in which some columns are 1A265T , SDFG123 
         up for some rows.                                      1A265T , SDFG123
 2       It's too large to 12345AB MB eyeball in order to 
         identify P12345AB                                      12345AB
 3       some data to the axis A6651F correct columns 
         for these rows                                         A6651F
 4       Any output that would allow me to identify
         that AJ_DF125AA12.                                       NA
 5       how do I find some locations 564789.                   564789  

I have use str_detect function.

df %>% 
  filter(str_detect(text, ".+[A-Z0-9,]+"))

Does anybody know the correct way??
在Base-R中

AllNumbers <- regmatches(df$Text, gregexpr("[A-z0-9]+\\d+[A-z0-9]+", df$Text))
AllNumbers <- sapply(AllNumbers, function(x) gsub("^D[A-z0-9]+","",x) )
AllLengths <- sapply(AllNumbers, nchar)

df$Pattern <- sapply(1:length(AllNumbers), function(x)  AllNumbers[[x]][AllLengths[[x]]>=5 & AllLengths[[x]]<=9])
你可以用

df = data.frame(Text=c(("in which some columns are 1A265T , SDFG123 up for some rows."),
                     ("It's too large to 12345AB MB eyeball in order to identify D12345AB"),
                     ("some data to the axis A6651F correct columns for these rows"),
                     ("Any output that would allow me to identify that AJ_DF125AA12."),
                     ("how do I find some locations 564789.")))

df$Pattern <- lapply(str_extract_all(df$Text, "\\b[A-CE-Z0-9][A-Z0-9]{4,8}\\b"), paste, collapse=",")
df[df==''] <- NA
正则表达式匹配

  • \b
    -单词边界
  • [A-CE-Z0-9]
    -ASCII数字或大写字母,而非
    D
  • [A-Z0-9]{4,8}
    -四到八个ASCII数字或大写字母
  • \b
    -单词边界

注意:您可以通过负前瞻来“简化”模式:

\b(?!D)[A-Z0-9]{5,9}\b

请参见其中的
(?!D)
要求下一个字符不应为
D

您的
df
,并且您的预期输出不匹配。例如,
df
包含“眼球以识别D12345AB”,但您的预期输出包含“以识别P12345AB”。是的..我将更正这些..任何限制模式不以D开头的想法??
                                                            Text        Pattern
1       in which some columns are 1A265T , SDFG123 up for some rows. 1A265T,SDFG123
2 It's too large to 12345AB MB eyeball in order to identify D12345AB        12345AB
3        some data to the axis A6651F correct columns for these rows         A6651F
4      Any output that would allow me to identify that AJ_DF125AA12.             NA
5                               how do I find some locations 564789.         564789
\b(?!D)[A-Z0-9]{5,9}\b