在R中查找文本中出现频率最高的单词_R_N Gram

在R中查找文本中出现频率最高的单词

在R中查找文本中出现频率最高的单词,r,n-gram,R,N Gram,有人能帮我用R找到一篇文章中最常用的两个和三个单词吗我的文本是… text <- c("There is a difference between the common use of the term phrase and its technical use in linguistics. In common usage, a phrase is usually a group of words with some special idiomatic meaning or other s

有人能帮我用R找到一篇文章中最常用的两个和三个单词吗

我的文本是…

text <- c("There is a difference between the common use of the term phrase and its technical use in linguistics. In common usage, a phrase is usually a group of words with some special idiomatic meaning or other significance, such as \"all rights reserved\", \"economical with the truth\", \"kick the bucket\", and the like. It may be a euphemism, a saying or proverb, a fixed expression, a figure of speech, etc. In grammatical analysis, particularly in theories of syntax, a phrase is any group of words, or sometimes a single word, which plays a particular role within the grammatical structure of a sentence. It does not have to have any special meaning or significance, or even exist anywhere outside of the sentence being analyzed, but it must function there as a complete grammatical unit. For example, in the sentence Yesterday I saw an orange bird with a white neck, the words an orange bird with a white neck form what is called a noun phrase, or a determiner phrase in some theories, which functions as the object of the sentence. Theorists of syntax differ in exactly what they regard as a phrase; however, it is usually required to be a constituent of a sentence, in that it must include all the dependents of the units that it contains. This means that some expressions that may be called phrases in everyday language are not phrases in the technical sense. For example, in the sentence I can't put up with Alex, the words put up with (meaning \'tolerate\') may be referred to in common language as a phrase (English expressions like this are frequently called phrasal verbs\ but technically they do not form a complete phrase, since they do not include Alex, which is the complement of the preposition with.")

text这里有一个简单的base R方法，用于5个最常见的单词：
head(sort(table(strsplit(gsub("[[:punct:]]", "", text), " ")), decreasing = TRUE), 5)

#     a    the     of     in phrase 
#    21     18     12     10      8 

它返回的是一个带有频率计数的整数向量，该向量的名称对应于已计数的单词

gsub（“[[：punct:][]”，“”，text）
删除标点符号，因为我猜您不想计算标点符号
strsplit（gsub（“[：punct:][]”，“”，text），“”）
在空格上拆分字符串
table（）
计算唯一元素的频率
排序（…，递减=TRUE）
按递减顺序排序
head（…，5）
只选择前5个最常用的单词
我们可以拆分单词并使用表格总结频率：
words <- strsplit(text, "[ ,.\\(\\)\"]")
sort(table(words, exclude = ""), decreasing = T)

words这个tidytext
包让这类事情变得非常简单：
library(tidytext)
library(dplyr)

data_frame(text = text) %>% 
    unnest_tokens(word, text) %>%    # split words
    anti_join(stop_words) %>%    # take out "a", "an", "the", etc.
    count(word, sort = TRUE)    # count occurrences

# Source: local data frame [73 x 2]
# 
#           word     n
#          (chr) (int)
# 1       phrase     8
# 2     sentence     6
# 3        words     4
# 4       called     3
# 5       common     3
# 6  grammatical     3
# 7      meaning     3
# 8         alex     2
# 9         bird     2
# 10    complete     2
# ..         ...   ...


如果问题是询问bigram和trigram的计数，tokenizers:：tokenize\u ngrams
非常有用：
library(tokenizers)

tokenize_ngrams(text, n = 3L, n_min = 2L, simplify = TRUE) %>%    # tokenize bigrams and trigrams
    as_data_frame() %>%    # structure
    count(value, sort = TRUE)    # count

# Source: local data frame [531 x 2]
# 
#           value     n
#          (fctr) (int)
# 1        of the     5
# 2      a phrase     4
# 3  the sentence     4
# 4          as a     3
# 5        in the     3
# 6        may be     3
# 7    a complete     2
# 8   a phrase is     2
# 9    a sentence     2
# 10      a white     2
# ..          ...   ...

你的文字是：
text <- c("There is a difference between the common use of the term phrase and its technical use in linguistics. In common usage, a phrase is usually a group of words with some special idiomatic meaning or other significance, such as \"all rights reserved\", \"economical with the truth\", \"kick the bucket\", and the like. It may be a euphemism, a saying or proverb, a fixed expression, a figure of speech, etc. In grammatical analysis, particularly in theories of syntax, a phrase is any group of words, or sometimes a single word, which plays a particular role within the grammatical structure of a sentence. It does not have to have any special meaning or significance, or even exist anywhere outside of the sentence being analyzed, but it must function there as a complete grammatical unit. For example, in the sentence Yesterday I saw an orange bird with a white neck, the words an orange bird with a white neck form what is called a noun phrase, or a determiner phrase in some theories, which functions as the object of the sentence. Theorists of syntax differ in exactly what they regard as a phrase; however, it is usually required to be a constituent of a sentence, in that it must include all the dependents of the units that it contains. This means that some expressions that may be called phrases in everyday language are not phrases in the technical sense. For example, in the sentence I can't put up with Alex, the words put up with (meaning \'tolerate\') may be referred to in common language as a phrase (English expressions like this are frequently called phrasal verbs\ but technically they do not form a complete phrase, since they do not include Alex, which is the complement of the preposition with.")

然后，我们将找到最常见的两个单词和三个单词短语
library(ngram)

# To find all two-word phrases in the test "text":
ng2 <- ngram(text, n = 2)

# To find all three-word phrases in the test "text":
ng3 <- ngram(text, n = 3)

我们还可以使用马尔可夫链来创建新序列：
# if we are using ng2 (bi-gram)
lnth = 2 
babble(ng = ng2, genlen = lnth)

# if we are using ng3 (tri-gram)
lnth = 3  
babble(ng = ng3, genlen = lnth)

最简单的
require(quanteda)

# bi-grams
topfeatures(dfm(text, ngrams = 2, verbose = FALSE))
##      of_the     a_phrase the_sentence       may_be         as_a       in_the    in_common    phrase_is 
##           5            4            4            3            3            3            2            2 
##  is_usually     group_of 
##           2            2 

# for tri-grams
topfeatures(dfm(text, ngrams = 3, verbose = FALSE))
##     a_phrase_is   group_of_words    of_a_sentence  of_the_sentence   for_example_in   example_in_the 
##               2                2                2                2                2                2 
## in_the_sentence   an_orange_bird orange_bird_with      bird_with_a 
#               2                2                2                2 

我不确定用户是否想要这个频率。检查@Manoj Kumar的答案。@RonakShah你可能是对的，但在这种情况下，问题的标题是误导性的。此答案根据标题提供了正确的解决方案。由于可以假设只有一小部分编程专家是NLP专家，我认为OP应该更清楚地说明预期输出。谢谢@RonakShah和RHertel，但我的问题肯定不会误导。你们都回答了我所需要的。谢谢大家。虽然我喜欢这个答案作为“查找最常用单词”的解决方案，但我相信更多的转换可能比删除标点符号更有帮助。特别是，我认为将所有条目转换为小写可能是一个好主意。我正在考虑使用tm
软件包提供一个替代方案，但这个问题似乎已经得到了OP满意的回答。我想你也应该截断“a，the，is”之类的单词（以及删除标点符号）。虽然这个问题不需要，但肯定会对其他NLP学习者/从业者有所帮助。谢谢。好的一个@alistaire用于计算发生频率的简短方法。第一种方法（tidytext）的末尾有一个%%
太多了。我在count（，word，sort=TRUE）中还得到了一个错误未使用的参数（sort=TRUE）
。这与plyr
的sort
命令冲突，可以使用dplyr:：count（word，sort=TRUE）
解决该问题。否则是最好的选择。嗨，肯。非常好。。简单，容易和几行。但有一个疑问，它能用来预测下一个单词吗？（就像android手机的快捷键键盘一样）。为什么两个单词之间有一个下划线？如果在预测模型中使用ngrams，它可以用来预测下一个单词。“\u”是ngrams（）
的串联器
参数的默认值，可以在dfm（）
中传递。请参见？quanteda:：tokenise
或？quanteda:：ngrams。
# if we are using ng2 (bi-gram)
lnth = 2 
babble(ng = ng2, genlen = lnth)

# if we are using ng3 (tri-gram)
lnth = 3  
babble(ng = ng3, genlen = lnth)

require(quanteda)

# bi-grams
topfeatures(dfm(text, ngrams = 2, verbose = FALSE))
##      of_the     a_phrase the_sentence       may_be         as_a       in_the    in_common    phrase_is 
##           5            4            4            3            3            3            2            2 
##  is_usually     group_of 
##           2            2 

# for tri-grams
topfeatures(dfm(text, ngrams = 3, verbose = FALSE))
##     a_phrase_is   group_of_words    of_a_sentence  of_the_sentence   for_example_in   example_in_the 
##               2                2                2                2                2                2 
## in_the_sentence   an_orange_bird orange_bird_with      bird_with_a 
#               2                2                2                2