Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/r/70.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Regex 正则表达式在捕获组时捕获所有内容';不存在_Regex_R_Data.table - Fatal编程技术网

Regex 正则表达式在捕获组时捕获所有内容';不存在

Regex 正则表达式在捕获组时捕获所有内容';不存在,regex,r,data.table,Regex,R,Data.table,我试图利用正则表达式捕获组从列中提取一些特定的文本。我注意到的一件事是,如果抓捕小组不存在,它会抓住一切 55 g <- fine 45 Gallon <- fine Amazing Hexagonal Fish Tank. <- not good. how to replace with NA? 92gallon

我试图利用正则表达式捕获组从列中提取一些特定的文本。我注意到的一件事是,如果抓捕小组不存在,它会抓住一切

55 g                            <- fine  
45 Gallon                       <- fine  
Amazing Hexagonal Fish Tank.    <- not good.  how to replace with NA?  
92gallon                        <- fine  
30 o                            <- wrongly identified  
29                              <- wrongly identified 
10 gallon                       <- I thought [0-9]{2,3} would grab 2 or 3 digits?  
10 gallon                       <- only 1 of 2 tank sizes identified
下面是我用来创建带有解析文本的新列的代码(R中的data.table语法):

55 g                            <- fine  
45 Gallon                       <- fine  
Amazing Hexagonal Fish Tank.    <- not good.  how to replace with NA?  
92gallon                        <- fine  
30 o                            <- wrongly identified  
29                              <- wrongly identified 
10 gallon                       <- I thought [0-9]{2,3} would grab 2 or 3 digits?  
10 gallon                       <- only 1 of 2 tank sizes identified
结果如下:

55 g                            <- fine  
45 Gallon                       <- fine  
Amazing Hexagonal Fish Tank.    <- not good.  how to replace with NA?  
92gallon                        <- fine  
30 o                            <- wrongly identified  
29                              <- wrongly identified 
10 gallon                       <- I thought [0-9]{2,3} would grab 2 or 3 digits?  
10 gallon                       <- only 1 of 2 tank sizes identified

55g我不确定您期望的确切输出是什么,但以下是我的尝试:

55 g                            <- fine  
45 Gallon                       <- fine  
Amazing Hexagonal Fish Tank.    <- not good.  how to replace with NA?  
92gallon                        <- fine  
30 o                            <- wrongly identified  
29                              <- wrongly identified 
10 gallon                       <- I thought [0-9]{2,3} would grab 2 or 3 digits?  
10 gallon                       <- only 1 of 2 tank sizes identified
x <- c('Nice 55 g fish tank with stand', 
       '45 Gallon Aquarium fish tank and Stand',
       'Amazing Hexagonal Fish Tank.', 
       '92gallon fish tank', 'Fish Tank & Stand $30 obo',
       "2007 PROLINE 29' GRAND SPORT CENTER CONSOLE", 
       '110 gallon tall fish tank',
       '20 and 10 Gallon Aquarium / Fish Tanks')

r <- regmatches(x, gregexpr('\\d{2,3}[^\n]*(?i:g\\b|gallon)', x, perl=TRUE))
unlist({r[sapply(r, length)==0] <- NA; r})

# [1] "55 g"             "45 Gallon"        NA                 "92gallon"        
# [5] NA                 NA                 "110 gallon"       "20 and 10 Gallon"

x off-topic:当通过引用赋值(使用
:=
)时,您不必重新赋值结果。也就是说,
data是的,我还没有对错误识别的数据考虑太多。。但是NA会更好!为什么每个字母都是可选的?只是因为有时人们会说90g、90gal、90gal、90g等等。我试图让代码更智能,以识别这些变化。但也许我可以这样做,它只会寻找一个数字后面跟着一个“Gg”或“Gg”。