Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/r/79.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
从R中的文本中提取字符级n-grams_R_Nlp_Character_N Gram - Fatal编程技术网

从R中的文本中提取字符级n-grams

从R中的文本中提取字符级n-grams,r,nlp,character,n-gram,R,Nlp,Character,N Gram,我有一个带有文本的数据帧,我想为R中的每个文本提取字符级的双字符图(n=2),例如“st”、“ac”、“ck” 我还想计算文本中每个字符级双字符的频率 数据: df$text [1] "hy my name is" [2] "stackover flow is great" [3] "how are you" 我不太确定你们在这里的预期产量。我本以为“stack”的bigram应该是“st”、“ta”、“ac”和“ck”,因为这会捕获每个连续的对 例如,如果您想知道“兄弟”一词中有多少个双

我有一个带有文本的数据帧,我想为R中的每个文本提取字符级的双字符图(n=2),例如“st”、“ac”、“ck”

我还想计算文本中每个字符级双字符的频率

数据:

df$text

[1] "hy my name is"
[2] "stackover flow is great"
[3] "how are you"


我不太确定你们在这里的预期产量。我本以为“stack”的bigram应该是“st”、“ta”、“ac”和“ck”,因为这会捕获每个连续的对

例如,如果您想知道“兄弟”一词中有多少个双字“th”,并将其拆分为双字“br”、“ot”、“he”和“rs”,那么您将得到答案0,这是错误的

您可以构建一个函数来获得如下所示的所有Bigram:

library(quanteda)
char_ngrams(unlist(tokens(df$text, "character")), concatenator = "")
#>  [1] "hy" "ym" "my" "yn" "na" "am" "me" "ei" "is" "ss" "st" "ta" "ac" "ck" 
#> [15] "ko" "ov" "ve" "er" "rf" "fl" "lo" "ow" "wi" "is" "sg" "gr" "re" "ea"
#> [29] "at" "th" "ho" "ow" "wa" "ar" "re" "ey" "yo" "ou"
#此函数获取单个字符的向量,并创建所有的bigram
#在这个向量内。例如,“s”、“t”、“a”、“c”、“k”变为
#“st”、“ta”、“ac”和“ck”

配对字符除了艾伦的答案

您可以将stringdist包中的
qgram
函数与
gsub
结合使用来删除空格

library(stringdist)
qgrams(gsub(" ", "", df1$text), q = 2)

   hy ym yn yo my na st ta ve wi wa ov rf sg ow re ou me is ko lo am ei er fl gr ho ey ck ea at ar ac
V1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  2  2  1  1  2  1  1  1  1  1  1  1  1  1  1  1  1  1  1

有没有办法计算它们的出现次数?
library(quanteda)
char_ngrams(unlist(tokens(df$text, "character")), concatenator = "")
#>  [1] "hy" "ym" "my" "yn" "na" "am" "me" "ei" "is" "ss" "st" "ta" "ac" "ck" 
#> [15] "ko" "ov" "ve" "er" "rf" "fl" "lo" "ow" "wi" "is" "sg" "gr" "re" "ea"
#> [29] "at" "th" "ho" "ow" "wa" "ar" "re" "ey" "yo" "ou"
library(stringdist)
qgrams(gsub(" ", "", df1$text), q = 2)

   hy ym yn yo my na st ta ve wi wa ov rf sg ow re ou me is ko lo am ei er fl gr ho ey ck ea at ar ac
V1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  2  2  1  1  2  1  1  1  1  1  1  1  1  1  1  1  1  1  1