Regex 正则表达式在捕获组时捕获所有内容';不存在
我试图利用正则表达式捕获组从列中提取一些特定的文本。我注意到的一件事是,如果抓捕小组不存在,它会抓住一切Regex 正则表达式在捕获组时捕获所有内容';不存在,regex,r,data.table,Regex,R,Data.table,我试图利用正则表达式捕获组从列中提取一些特定的文本。我注意到的一件事是,如果抓捕小组不存在,它会抓住一切 55 g <- fine 45 Gallon <- fine Amazing Hexagonal Fish Tank. <- not good. how to replace with NA? 92gallon
55 g <- fine
45 Gallon <- fine
Amazing Hexagonal Fish Tank. <- not good. how to replace with NA?
92gallon <- fine
30 o <- wrongly identified
29 <- wrongly identified
10 gallon <- I thought [0-9]{2,3} would grab 2 or 3 digits?
10 gallon <- only 1 of 2 tank sizes identified
下面是我用来创建带有解析文本的新列的代码(R中的data.table语法):
55 g <- fine
45 Gallon <- fine
Amazing Hexagonal Fish Tank. <- not good. how to replace with NA?
92gallon <- fine
30 o <- wrongly identified
29 <- wrongly identified
10 gallon <- I thought [0-9]{2,3} would grab 2 or 3 digits?
10 gallon <- only 1 of 2 tank sizes identified
结果如下:
55 g <- fine
45 Gallon <- fine
Amazing Hexagonal Fish Tank. <- not good. how to replace with NA?
92gallon <- fine
30 o <- wrongly identified
29 <- wrongly identified
10 gallon <- I thought [0-9]{2,3} would grab 2 or 3 digits?
10 gallon <- only 1 of 2 tank sizes identified
55g我不确定您期望的确切输出是什么,但以下是我的尝试:
55 g <- fine
45 Gallon <- fine
Amazing Hexagonal Fish Tank. <- not good. how to replace with NA?
92gallon <- fine
30 o <- wrongly identified
29 <- wrongly identified
10 gallon <- I thought [0-9]{2,3} would grab 2 or 3 digits?
10 gallon <- only 1 of 2 tank sizes identified
x <- c('Nice 55 g fish tank with stand',
'45 Gallon Aquarium fish tank and Stand',
'Amazing Hexagonal Fish Tank.',
'92gallon fish tank', 'Fish Tank & Stand $30 obo',
"2007 PROLINE 29' GRAND SPORT CENTER CONSOLE",
'110 gallon tall fish tank',
'20 and 10 Gallon Aquarium / Fish Tanks')
r <- regmatches(x, gregexpr('\\d{2,3}[^\n]*(?i:g\\b|gallon)', x, perl=TRUE))
unlist({r[sapply(r, length)==0] <- NA; r})
# [1] "55 g" "45 Gallon" NA "92gallon"
# [5] NA NA "110 gallon" "20 and 10 Gallon"
x off-topic:当通过引用赋值(使用:=
)时,您不必重新赋值结果。也就是说,data是的,我还没有对错误识别的数据考虑太多。。但是NA会更好!为什么每个字母都是可选的?只是因为有时人们会说90g、90gal、90gal、90g等等。我试图让代码更智能,以识别这些变化。但也许我可以这样做,它只会寻找一个数字后面跟着一个“Gg”或“Gg”。