Regex 从没有语料库的数据帧中提取子词列表
事实上,我想在数据框中提取一些子单词列表,我知道我们可以通过语料库提取,但我不想做不必要的事情。首先,我使用了匹配以及grep,但问题是除了精确匹配之外不能使用match,grep不能用于多个单词Regex 从没有语料库的数据帧中提取子词列表,regex,r,dataframe,Regex,R,Dataframe,事实上,我想在数据框中提取一些子单词列表,我知道我们可以通过语料库提取,但我不想做不必要的事情。首先,我使用了匹配以及grep,但问题是除了精确匹配之外不能使用match,grep不能用于多个单词 a=sample(c("Client","offshor","V1fax","12mobile"),10) z=data.frame(a) z a 1 V1fax 2 V1fax 3 12mobile 4 12mobile 5 V1fax 6
a=sample(c("Client","offshor","V1fax","12mobile"),10)
z=data.frame(a)
z
a
1 V1fax
2 V1fax
3 12mobile
4 12mobile
5 V1fax
6 clint
7 offshor
8 clint
9 clint
10 12mobile
d=z[is.na(match(tolower(z[,1]),c("fax","mobile","except","talwade"))),]
grep(c("fax","mobile","except","talwade"),tolower(z[,1]))
[1] 1 2 5
Warning message:
In grep(c("fax", "mobile", "except", "talwade" :
argument 'pattern' has length > 1 and only the first element will be used
希望o/p为
z
a
1 clint
2 offshor
3 clint
4 clint
正如预期的那样,任何有效的方法都可以提取子词列表。谢谢。您可以使用
grep
完成此操作。您只需使用正则表达式或运算符,即
grep( paste( c("fax","mobile","except","talwade") , collapse = "|" ) , tolower(z[,1]) )
# [1] 1 2 3 4 5 10
# The pattern...
paste( c("fax","mobile","except","talwade") , collapse = "|" )
# [1] "fax|mobile|except|talwade"
这将比Simon的解决方案慢一点,但它可以访问更多的数据进行分析。您可以使用sapply
返回匹配矩阵:
patterns <- c("fax","mobile","except","talwade")
match.mat <- sapply(patterns, grepl, z$a)
rownames(match.mat) <- z$a
# fax mobile except talwade
# V1fax TRUE FALSE FALSE FALSE
# V1fax TRUE FALSE FALSE FALSE
# 12mobile FALSE TRUE FALSE FALSE
# 12mobile FALSE TRUE FALSE FALSE
# V1fax TRUE FALSE FALSE FALSE
# clint FALSE FALSE FALSE FALSE
# offshor FALSE FALSE FALSE FALSE
# clint FALSE FALSE FALSE FALSE
# clint FALSE FALSE FALSE FALSE
# 12mobile FALSE TRUE FALSE FALSE
哪些:
which(rowSums(match.mat) > 0)
# V1fax V1fax 12mobile 12mobile V1fax 12mobile
# 1 2 3 4 5 10
对于特定单词,匹配的模式是什么,反之亦然:
which(match.mat["12mobile", ])
which(match.mat[, "fax"])
which(match.mat["12mobile", ])
which(match.mat[, "fax"])