R：计算列中每个唯一字符的频率_R

R：计算列中每个唯一字符的频率

R：计算列中每个唯一字符的频率,r,R,我有一个数据框df，其中包含一个名为strings的列。本栏中的值是一些句子例如： id strings 1 "I want to go to school, how about you?" 2 "I like you." 3 "I like you so much" 4 "I like you very much" 5 "I don't like you" 现在，我有一个停止词的列表 ["I", "don't" "you"] 如何制作另一个数

我有一个数据框

df

，其中包含一个名为

strings

的列。本栏中的值是一些句子

例如：

id    strings
1     "I want to go to school, how about you?"
2     "I like you."
3     "I like you so much"
4     "I like you very much"
5     "I don't like you"

现在，我有一个停止词的列表

["I", "don't" "you"]

如何制作另一个数据框，在上一个数据框的列中存储每个唯一字（停止字除外）的出现总数

keyword      frequency
  want            1
  to              2
  go              1
  school          1
  how             1
  about           1
  like            4
  so              1
  very            1
  much            2

我的想法是：

将列中的字符串组合成一个大字符串

制作一个在大字符串中存储唯一字符的列表

生成一列为唯一单词的df

计算频率

但这似乎效率很低，我不知道如何真正编写代码。

假设您有一个

mystring

对象和一个

stopWords

向量，您可以这样做：

# split text into words vector
wordvector = strsplit(mystring, " ")[[1]]

# remove stopwords from the vector
vector = vector[!vector %in% stopWords]

此时，您可以将频率

table（）

转换为

dataframe

对象：

frequency_df = data.frame(table(words))

如果这对您有帮助，请告诉我。

一种方法是使用

tidytext

。这里有一个密码

library("tidytext")
library("tidyverse")

#> df <- data.frame( id = 1:6, strings = c("I want to go to school", "how about you?",
#> "I like you.", "I like you so much", "I like you very much", "I don't like you"))

df %>% 
  mutate(strings = as.character(strings)) %>% 
  unnest_tokens(word, string) %>%   #this tokenize the strings and extract the words
  filter(!word %in% c("I", "i", "don't", "you")) %>% 
  count(word)

#> # A tibble: 11 x 2
#>    word       n
#>    <chr>  <int>
#>  1 about      1
#>  2 go         1
#>  3 how        1
#>  4 like       4
#>  5 much       2

library（“tidytext”）
图书馆（“tidyverse”）
#>df“我喜欢你”，“我非常喜欢你”，“我非常喜欢你”，“我不喜欢你”））
df%>%
变异（字符串=as.character（字符串））%>%
unnest_标记（单词、字符串）%>%#这将标记字符串并提取单词
过滤器（！单词%in%c（“我”、“我”、“不”、“你”））%>%
计数（字）
#>#A tibble:11 x 2
#>单词n
#>      
#>1大约1
#>2去1
#>3如何1
#>4像4
#>5多2

编辑

所有标记都转换为小写，因此您可以在stop_单词中包含

，或者将参数

lower_case=FALSE

添加到

unnest_标记中
首先，您可以通过str_split
创建所有单词的向量，然后创建单词的频率表
library(stringr)
stop_words <- c("I", "don't", "you")

# create a vector of all words in your df
all_words <- unlist(str_split(df$strings, pattern = " "))

# create a frequency table 
word_list <- as.data.frame(table(all_words))

# omit all stop words from the frequency table
word_list[!word_list$all_words %in% stop_words, ]

库（stringr）
停下来，这真的很清楚。但是我怎样才能把我列中所有的唯一单词放到向量中呢？我刚开始就被卡住了。我直接换了帖子，希望这能有所帮助。