R:计算列中每个唯一字符的频率
我有一个数据框R:计算列中每个唯一字符的频率,r,R,我有一个数据框df,其中包含一个名为strings的列。本栏中的值是一些句子 例如: id strings 1 "I want to go to school, how about you?" 2 "I like you." 3 "I like you so much" 4 "I like you very much" 5 "I don't like you" 现在,我有一个停止词的列表 ["I", "don't" "you"] 如何制作另一个数
df
,其中包含一个名为strings
的列。本栏中的值是一些句子
例如:
id strings
1 "I want to go to school, how about you?"
2 "I like you."
3 "I like you so much"
4 "I like you very much"
5 "I don't like you"
现在,我有一个停止词的列表
["I", "don't" "you"]
如何制作另一个数据框,在上一个数据框的列中存储每个唯一字(停止字除外)的出现总数
keyword frequency
want 1
to 2
go 1
school 1
how 1
about 1
like 4
so 1
very 1
much 2
我的想法是:
但这似乎效率很低,我不知道如何真正编写代码。假设您有一个
mystring
对象和一个stopWords
向量,您可以这样做:
# split text into words vector
wordvector = strsplit(mystring, " ")[[1]]
# remove stopwords from the vector
vector = vector[!vector %in% stopWords]
此时,您可以将频率table()
转换为dataframe
对象:
frequency_df = data.frame(table(words))
如果这对您有帮助,请告诉我。一种方法是使用
tidytext
。这里有一个密码
library("tidytext")
library("tidyverse")
#> df <- data.frame( id = 1:6, strings = c("I want to go to school", "how about you?",
#> "I like you.", "I like you so much", "I like you very much", "I don't like you"))
df %>%
mutate(strings = as.character(strings)) %>%
unnest_tokens(word, string) %>% #this tokenize the strings and extract the words
filter(!word %in% c("I", "i", "don't", "you")) %>%
count(word)
#> # A tibble: 11 x 2
#> word n
#> <chr> <int>
#> 1 about 1
#> 2 go 1
#> 3 how 1
#> 4 like 4
#> 5 much 2
library(“tidytext”)
图书馆(“tidyverse”)
#>df“我喜欢你”,“我非常喜欢你”,“我非常喜欢你”,“我不喜欢你”))
df%>%
变异(字符串=as.character(字符串))%>%
unnest_标记(单词、字符串)%>%#这将标记字符串并提取单词
过滤器(!单词%in%c(“我”、“我”、“不”、“你”))%>%
计数(字)
#>#A tibble:11 x 2
#>单词n
#>
#>1大约1
#>2去1
#>3如何1
#>4像4
#>5多2
编辑
所有标记都转换为小写,因此您可以在stop_单词中包含
i
,或者将参数lower_case=FALSE
添加到unnest_标记中首先,您可以通过str_split
创建所有单词的向量,然后创建单词的频率表
library(stringr)
stop_words <- c("I", "don't", "you")
# create a vector of all words in your df
all_words <- unlist(str_split(df$strings, pattern = " "))
# create a frequency table
word_list <- as.data.frame(table(all_words))
# omit all stop words from the frequency table
word_list[!word_list$all_words %in% stop_words, ]
库(stringr)
停下来,这真的很清楚。但是我怎样才能把我列中所有的唯一单词放到向量中呢?我刚开始就被卡住了。我直接换了帖子,希望这能有所帮助。