来自大型文档集的R术语频率

来自大型文档集的R术语频率,r,sorting,text-mining,R,Sorting,Text Mining,我有一个这样的数据框 ID content 1 hello you how are you 1 you are ok 2 test 我需要通过id获得内容中每个单词的频率,这些单词是空格分隔的。这基本上是在列中查找唯一的术语,并查找按Id分组的频率和显示 ID hello you how are ok test 1 1 3 1 2 1 0 2

我有一个这样的数据框

ID       content
 1       hello you how are you
 1       you are ok
 2       test
我需要通过id获得内容中每个单词的频率,这些单词是空格分隔的。这基本上是在列中查找唯一的术语,并查找按Id分组的频率和显示

ID      hello    you   how   are  ok    test
 1        1       3     1    2     1     0
 2        0       0     0    0     0     1    
我试过了

test<- unique(unlist(strsplit(temp$val, split=" ")))

df<- cbind(temp, sapply(test, function(y) apply(temp, 1, function(x) as.integer(y %in% unlist(strsplit(x, split=" "))))))

test您可以使用
data.table

library(data.table)
setDT(df1)[, unlist(strsplit(content, split = " ")), by = ID
           ][, dcast(.SD, ID ~ V1)]
#   ID are hello how ok test you
#1:  1   2     1   1  1    0   3
#2:  2   0     0   0  0    1   0

在第一部分中,我们按
ID
的组使用
unlist(strsplit(content,split=”“)
,它给出了以下输出:

#   ID    V1
#1:  1 hello
#2:  1   you
#3:  1   how
#4:  1   are
#5:  1   you
#6:  1   you
#7:  1   are
#8:  1    ok
#9:  2  test
在下一步中,我们使用
dcast
将数据扩展为宽格式

数据

df1 <- structure(list(ID = c(1L, 1L, 2L), content = c("hello you how are you", 
"you are ok", "test")), .Names = c("ID", "content"), class = "data.frame", row.names = c(NA, 
-3L))

df1您可以使用
data.table

library(data.table)
setDT(df1)[, unlist(strsplit(content, split = " ")), by = ID
           ][, dcast(.SD, ID ~ V1)]
#   ID are hello how ok test you
#1:  1   2     1   1  1    0   3
#2:  2   0     0   0  0    1   0

在第一部分中,我们按
ID
的组使用
unlist(strsplit(content,split=”“)
,它给出了以下输出:

#   ID    V1
#1:  1 hello
#2:  1   you
#3:  1   how
#4:  1   are
#5:  1   you
#6:  1   you
#7:  1   are
#8:  1    ok
#9:  2  test
在下一步中,我们使用
dcast
将数据扩展为宽格式

数据

df1 <- structure(list(ID = c(1L, 1L, 2L), content = c("hello you how are you", 
"you are ok", "test")), .Names = c("ID", "content"), class = "data.frame", row.names = c(NA, 
-3L))

df1一个用于文本挖掘的包怎么样

# your data
text <- read.table(text = "
ID      content
1       'hello you how are you'
1       'you are ok'
2       'test'", header = T,  stringsAsFactors = FALSE) # remember the stringAsFactors life saver!
#您的数据

text为文本挖掘制作的包怎么样

# your data
text <- read.table(text = "
ID      content
1       'hello you how are you'
1       'you are ok'
2       'test'", header = T,  stringsAsFactors = FALSE) # remember the stringAsFactors life saver!
#您的数据
文本