str_detect工作时,使用%in%的字符串搜索(包含特殊字符)无效
我在做情绪分析,我想让所有的大字都以否定词开头,比如“没有”。在%中使用str_detect工作时,使用%in%的字符串搜索(包含特殊字符)无效,r,regex,stringr,R,Regex,Stringr,我在做情绪分析,我想让所有的大字都以否定词开头,比如“没有”。在%中使用%可以很好地处理简单字符串,但对于那些包含特殊字符(如撇号)的字符串,它不适用于我的文本 文本中的双字符: > head(sup4_bigrams_count,3) # A tibble: 3 x 3 word1 word2 n <chr> <chr> <int> 1 parent’s day 8 2 mother’s d
%可以很好地处理简单字符串,但对于那些包含特殊字符(如撇号)的字符串,它不适用于我的文本
文本中的双字符:
> head(sup4_bigrams_count,3)
# A tibble: 3 x 3
word1 word2 n
<chr> <chr> <int>
1 parent’s day 8
2 mother’s day 7
3 bachelor’s degree 6
> sup4_bigrams_count$word1 %>% unique
......
[61] "daily" "day" "de" "define"
[65] "depth" "developed" "didn’t" "differentiated"
[69] "difunctioning" "diploma" "doesn’t" "don’t"
但是使用%in%根本不起作用
negate_words <- c("didn’t","doesn’t","don’t")
> sup4_bigrams_count %>% filter(word1 %in% negate_words)
# A tibble: 0 x 3
# ... with 3 variables: word1 <chr>, word2 <chr>, n <int>
negate\u words sup4\u bigrams\u count%>%过滤器(word1%在%negate\u words中)
#一个tibble:0 x 3
# ... 有3个变量:word1、word2、n
但如果我用这些词来创建另一个数据帧,%in%就可以了
a <- data_frame(word=c("didn’t","doesn’t","don’t"),ind=1:3)
n <- c("didn’t","doesn’t")
> a %>% filter(word %in% n)
# A tibble: 2 x 2
word ind
<chr> <int>
1 didn’t 1
2 doesn’t 2
a%过滤器(单词%n中的%n)
#一个tibble:2x2
单词索引
我没有
2不等于2
我所能做的只是通过str\u detect
三次过滤,然后rbind
将它们一起过滤,但是如果我有一长串否定词的话,那就麻烦多了,也不容易了。希望有人能帮上忙。你可以构造一个“OR”正则表达式,一次搜索所有否定词
library(stringr)
negate_words <- c("didn’t","doesn’t","don’t")
strings <- c("daily", "day", "de", "define",
"depth", "developed", "didn’t", "differentiated",
"difunctioning", "diploma", "doesn’t", "don’t")
str_detect(strings, "didn’t")
# FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
pattern <- paste0("(", paste(negate_words, collapse="|"), ")")
pattern
# "(didn’t|doesn’t|don’t)"
str_detect(strings, pattern)
# FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE TRUE
库(stringr)
否定词
library(stringr)
negate_words <- c("didn’t","doesn’t","don’t")
strings <- c("daily", "day", "de", "define",
"depth", "developed", "didn’t", "differentiated",
"difunctioning", "diploma", "doesn’t", "don’t")
str_detect(strings, "didn’t")
# FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
pattern <- paste0("(", paste(negate_words, collapse="|"), ")")
pattern
# "(didn’t|doesn’t|don’t)"
str_detect(strings, pattern)
# FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE TRUE