R 从字符向量中删除不是特定单词的所有单词
我有一个像这样的角色列表R 从字符向量中删除不是特定单词的所有单词,r,character,text-mining,R,Character,Text Mining,我有一个像这样的角色列表 [70] "CSF 5896-6133" [71] "CRT 16" [72] "SEEF 54-55"
[70] "CSF 5896-6133"
[71] "CRT 16"
[72] "SEEF 54-55"
[73] "CIF 190-195"
[74] "DE & /ON CIF 196-222"
[75] " CRT 17 "
[76] " SEEF 56-57"
[77] "DE & /ON CSF 6134-6725 "
[78] " SEEF 58-60"
[79] "CRT 18"
[80] " CSF 6726-6837"
[81] "SEEF 61"
[82] " CSF 6840-6926"
[83] " CIF 223-226"
[84] "SEEF 62-63"
[85] " CSF 6927-7065"
[86] " CIF 226-228"
[87] "CSF 7066-7185"
[88] "CSF 7186-7311"
[89] " CIF 229"
[90] " SEEF 66"
[91] "CSF 7312-7561"
[92] " CRT 19"
[93] " SEEF 67-68"
[94] "Final data QAQC done on CSF 1-7561"
[95] " CIF 1-229"
[96] " SEEF 1-68 "
[97] " CRT 1-19"
[98] "082015-HOBA-G17-1 changed to offPlot based on GIS review of searched area"
正如你所看到的,这只是其中的一部分
我想删除所有不是数字或数字的单词
CSF, CIF, SEEF, CRT
例如,94-98中的部分
[94] "CSF 1-7561"
[95] " CIF 1-229"
[96] " SEEF 1-68 "
[97] " CRT 1-19"
正如您所看到的,第98行将被完全删除,因为它没有我想要的关键字。第94行也删除了一些单词 考虑以下向量:
v <- c("Final data QAQC done on CSF 1-7561",
"CIF 1-229",
"SEEF 1-68",
"CRT 1-19",
"082015-HOBA-G17-1 changed to offPlot based on GIS review of searched area")
其中:
#[[1]]
#[1] "CSF" "1-7561"
#
#[[2]]
#[1] "CIF" "1-229"
#
#[[3]]
#[1] "SEEF" "1-68"
#
#[[4]]
#[1] "CRT" "1-19"
#
#[[5]]
#[1] NA
根据@akrun提到的,您还可以:
regmatches(v, gregexpr(pattern, v))
其中:
#[[1]]
#[1] "CSF" "1-7561"
#
#[[2]]
#[1] "CIF" "1-229"
#
#[[3]]
#[1] "SEEF" "1-68"
#
#[[4]]
#[1] "CRT" "1-19"
#
#[[5]]
#character(0)
使用stringr:
我会使用stringr库
这是您的数据的一个子集
x <- c("CSF 5896-6133",
"CRT 16",
"SEEF 54-55",
"CIF 190-195",
"Final data QAQC done on CSF 1-7561",
"082015-HOBA-G17-1 changed to offPlot based on GIS review of searched area"
)
如果没有与模式匹配的内容,它将返回一个缺少的值。请查看我5分钟前发布的答案;我承认这很相似。试试一个稍微不同的正则表达式。这和@Psidom不一样吗?是的,非常相似!我只是在他回复之前贴了一点,实际上他在16:54:18贴,你在16:54:44贴;无论如何,它也是一个稍微不同的正则表达式,所以OP可以尝试所有的解决方案。干杯基本R选项是regmatchesv、gregexprpattern、v。加一
#[[1]]
#[1] "CSF" "1-7561"
#
#[[2]]
#[1] "CIF" "1-229"
#
#[[3]]
#[1] "SEEF" "1-68"
#
#[[4]]
#[1] "CRT" "1-19"
#
#[[5]]
#character(0)
library(stringr)
testString <- c("Final data QAQC done on CSF 1-7561" ,
" CIF 1-229" ,
" SEEF 1-68 ",
" CRT 1-19",
"082015-HOBA-G17-1 changed to offPlot based on GIS review of searched area" )
str_extract(testString, "(CSF|CIF|SEEF|CRT)\\s+\\d+-\\d+")
[1] "CSF 1-7561" "CIF 1-229" "SEEF 1-68" "CRT 1-19" NA
x <- c("CSF 5896-6133",
"CRT 16",
"SEEF 54-55",
"CIF 190-195",
"Final data QAQC done on CSF 1-7561",
"082015-HOBA-G17-1 changed to offPlot based on GIS review of searched area"
)
library(stringr)
> str_extract(x, '(CSF|CIF|SEEF|CRT)[:space:]+([0-9]|-)+')
[1] "CSF 5896-6133" "CRT 16" "SEEF 54-55" "CIF 190-195" "CSF 1-7561"
[6] NA