从R中的字符串创建单词包_R_Text_Text Mining

从R中的字符串创建单词包

r text

从R中的字符串创建单词包,r,text,text-mining,R,Text,Text Mining,我找到了很多单词包的实现，但仍然找不到简单的长字符串的简单实现。我的结果是： word1: 56 word2: 31 word:X 7 我对qdap库有问题，因为in在我的R上不起作用。由于大小写和标点符号的原因，使用类似于strsplit的东西可能无法完全满足您的需要。tokenizers包是由tidytext使用的 library(tokenizers) text <- "this is some random TEXT is string 45 thing

我找到了很多单词包的实现，但仍然找不到简单的长字符串的简单实现。我的结果是：

word1:     56
word2:     31
word:X     7

我对

qdap

库有问题，因为in在我的R上不起作用。

由于大小写和标点符号的原因，使用类似于

strsplit

的东西可能无法完全满足您的需要。

tokenizers

包是由

tidytext

使用的

library(tokenizers)

text <- "this is some random TEXT is string 45 things and numbers and text!"

table(tokenize_words(text))

     45     and      is numbers  random    some  string    text  things    this 
      1       2       2       1       1       1       1       2       1       1

library(dplyr)
library(tidytext)
library(tibble)

df <- tibble(string = text)

df %>%
  unnest_tokens(word, string) %>%
  count(word)

# A tibble: 10 x 2
   word        n
   <chr>   <int>
 1 45          1
 2 and         2
 3 is          2
 4 numbers     1
 5 random      1
 6 some        1
 7 string      1
 8 text        2
 9 things      1
10 this        1

如果您选择此路线，您可能希望完全跳转到

tidytext

library(tokenizers)

text <- "this is some random TEXT is string 45 things and numbers and text!"

table(tokenize_words(text))

     45     and      is numbers  random    some  string    text  things    this 
      1       2       2       1       1       1       1       2       1       1

library(dplyr)
library(tidytext)
library(tibble)

df <- tibble(string = text)

df %>%
  unnest_tokens(word, string) %>%
  count(word)

# A tibble: 10 x 2
   word        n
   <chr>   <int>
 1 45          1
 2 and         2
 3 is          2
 4 numbers     1
 5 random      1
 6 some        1
 7 string      1
 8 text        2
 9 things      1
10 this        1

库（dplyr）
图书馆（tidytext）
图书馆（tibble）
df%
unnest_标记（字、字符串）%%>%
计数（字）
#一个tibble:10x2
单词n
1 45          1
2和2
三等于二
4数字1
5随机1
6一些1
7字符串1
8案文2
9件事1
10这个1

由于大小写和标点符号的原因，使用类似于strsplit的东西可能无法完全满足您的需要。

tokenizers

包是由

tidytext

使用的

library(tokenizers)

text <- "this is some random TEXT is string 45 things and numbers and text!"

table(tokenize_words(text))

     45     and      is numbers  random    some  string    text  things    this 
      1       2       2       1       1       1       1       2       1       1

library(dplyr)
library(tidytext)
library(tibble)

df <- tibble(string = text)

df %>%
  unnest_tokens(word, string) %>%
  count(word)

# A tibble: 10 x 2
   word        n
   <chr>   <int>
 1 45          1
 2 and         2
 3 is          2
 4 numbers     1
 5 random      1
 6 some        1
 7 string      1
 8 text        2
 9 things      1
10 this        1

如果您选择此路线，您可能希望完全跳转到

tidytext

library(tokenizers)

text <- "this is some random TEXT is string 45 things and numbers and text!"

table(tokenize_words(text))

     45     and      is numbers  random    some  string    text  things    this 
      1       2       2       1       1       1       1       2       1       1

library(dplyr)
library(tidytext)
library(tibble)

df <- tibble(string = text)

df %>%
  unnest_tokens(word, string) %>%
  count(word)

# A tibble: 10 x 2
   word        n
   <chr>   <int>
 1 45          1
 2 and         2
 3 is          2
 4 numbers     1
 5 random      1
 6 some        1
 7 string      1
 8 text        2
 9 things      1
10 this        1

库（dplyr）
图书馆（tidytext）
图书馆（tibble）
df%
unnest_标记（字、字符串）%%>%
计数（字）
#一个tibble:10x2
单词n
1 45          1
2和2
三等于二
4数字1
5随机1
6一些1
7字符串1
8案文2
9件事1
10这个1

如果您包含一个简单的示例输入和所需的输出，可用于测试和验证可能的解决方案，则更容易为您提供帮助。只需使用<代码>表格在拆分词中如果您包含一个简单的示例输入和所需的输出，可用于测试和验证可能的解决方案，则更容易帮助您。只需使用<代码>表格关于拆分的单词