将术语频率对列表放入R中的矩阵中_R_Matrix_Information Retrieval_Tm_Word Frequency

将术语频率对列表放入R中的矩阵中

r matrix

将术语频率对列表放入R中的矩阵中,r,matrix,information-retrieval,tm,word-frequency,R,Matrix,Information Retrieval,Tm,Word Frequency,我有一个以下格式的大数据集，每行上都有一个文档，编码为word：文档中的Frequency，用空格分隔；行的长度可以是可变的： aword:3 bword:2 cword:15 dword:2 bword:4 cword:20 fword:1 etc... 例如，在第一份文件中，“aword”出现3次。我最终想做的是创建一个小搜索引擎，在那里匹配查询的文档（格式相同）被排序；我考虑过如何使用TfIdf和tm软件包（基于本教程，本教程要求数据的格式为TermDocumentMatrix:）。否则

我有一个以下格式的大数据集，每行上都有一个文档，编码为word：文档中的Frequency，用空格分隔；行的长度可以是可变的：

aword:3 bword:2 cword:15 dword:2
bword:4 cword:20 fword:1
etc...

例如，在第一份文件中，“aword”出现3次。我最终想做的是创建一个小搜索引擎，在那里匹配查询的文档（格式相同）被排序；我考虑过如何使用TfIdf和tm软件包（基于本教程，本教程要求数据的格式为TermDocumentMatrix:）。否则，我只会在文本语料库上使用tm的TermDocumentMatrix函数，但这里的问题是我已经用这种格式对这些数据进行了索引（我更愿意使用这些数据，除非这种格式确实是外来的，无法转换）

到目前为止，我尝试的是导入行并拆分它们：

docs <- scan("data.txt", what="", sep="\n")
doclist <- strsplit(docs, "[[:space:]]+")

然后我可以将其转换为TermDocumentMatrix并开始学习教程。我有一种感觉，我在这里遗漏了一些非常明显的东西，一些我可能找不到的东西，因为我不知道这些东西叫什么（我在谷歌上搜索了一天，主题是“术语文档向量/数组/对”，“二维数组”，“列表到矩阵”等等）

将这样一个文档列表放入术语文档频率矩阵中的好方法是什么？或者，如果内置函数的解决方案过于明显或可行：我上面描述的格式的实际术语是什么，其中有这些术语：一行上的频率对，每行都是一个文档？

以下是一种方法，可以获得您可能想要的输出：

## Your sample data
x <- c("aword:3 bword:2 cword:15 dword:2", "bword:4 cword:20 fword:1")
## Split on a spaces and colons    
B <- strsplit(x, "\\s+|:")
## Add names to your list to represent the source document
B <- setNames(B, paste0("document", seq_along(B)))
## Put everything together into a long matrix
out <- do.call(rbind, lapply(seq_along(B), function(x) 
  cbind(document = names(B)[x], matrix(B[[x]], ncol = 2, byrow = TRUE,
                                dimnames = list(NULL, c("word", "count"))))))

## Convert to a data.frame
out <- data.frame(out)
out
#    document  word count
# 1 document1 aword     3
# 2 document1 bword     2
# 3 document1 cword    15
# 4 document1 dword     2
# 5 document2 bword     4
# 6 document2 cword    20
# 7 document2 fword     1
## Make sure the counts column is a number
out$count <- as.numeric(as.character(out$count))

## Use xtabs to get the output you want
xtabs(count ~ word + document, out)
#        document
# word    document1 document2
#   aword         3         0
#   bword         2         4
#   cword        15        20
#   dword         2         0
#   fword         0         1

##您的示例数据
x查看我的更新答案，在创建“out”data.frame
的过程中，使用矩阵而不是data.frames。我以前做过，现在已经合并了，看起来很整洁！经过测试，非常适合我的数据，非常感谢。我特别喜欢创建数据帧的简单而优雅的方式，然后使用xtabs函数来获取矩阵，这将在将来记住它。
        doc1 doc2 doc3 doc4 ...
aword   3    0    0    0 
bword   2    4    0    0
cword:  15   20   0    0
dword   2    0    0    0
fword:  0    1    0    0
...

## Your sample data
x <- c("aword:3 bword:2 cword:15 dword:2", "bword:4 cword:20 fword:1")
## Split on a spaces and colons    
B <- strsplit(x, "\\s+|:")
## Add names to your list to represent the source document
B <- setNames(B, paste0("document", seq_along(B)))
## Put everything together into a long matrix
out <- do.call(rbind, lapply(seq_along(B), function(x) 
  cbind(document = names(B)[x], matrix(B[[x]], ncol = 2, byrow = TRUE,
                                dimnames = list(NULL, c("word", "count"))))))

## Convert to a data.frame
out <- data.frame(out)
out
#    document  word count
# 1 document1 aword     3
# 2 document1 bword     2
# 3 document1 cword    15
# 4 document1 dword     2
# 5 document2 bword     4
# 6 document2 cword    20
# 7 document2 fword     1
## Make sure the counts column is a number
out$count <- as.numeric(as.character(out$count))

## Use xtabs to get the output you want
xtabs(count ~ word + document, out)
#        document
# word    document1 document2
#   aword         3         0
#   bword         2         4
#   cword        15        20
#   dword         2         0
#   fword         0         1