R：按索引合并文本文档_R_Text Mining

R：按索引合并文本文档

R：按索引合并文本文档,r,text-mining,R,Text Mining,我有一个如下所示的数据框： _________________id ________________text______ 1 | 7821 | "some text here" 2 | 7821 | "here as well" 3 | 7821 | "and here" 4 | 567 | "etcetera" 5 | 567

我有一个如下所示的数据框：

_________________id ________________text______
    1   | 7821             | "some text here"
    2   | 7821             |  "here as well"
    3   | 7821             |  "and here"
    4   | 567              |   "etcetera"
    5   | 567              |    "more text"
    6   | 231              |   "other text"

我想按ID对文本进行分组，这样我可以运行聚类算法：

________________id___________________text______
    1   | 7821             | "some text here here as well and here"
    2   | 567              |   "etcetera more text"
    3   | 231              |   "other text"

有没有办法做到这一点？我正在从数据库表导入数据，而且我有很多数据，因此无法手动执行

您实际上是在寻找

aggregate

，而不是

merge

，应该有很多示例来演示不同的聚合选项。下面是最基本和最直接的方法，使用公式方法指定要聚合的列

这是您的数据的副本和粘贴形式

mydata <- structure(list(id = c(7821L, 7821L, 7821L, 567L, 567L, 231L), 
    text = structure(c(6L, 3L, 1L, 2L, 4L, 5L), .Label = c("and here", 
    "etcetera", "here as well", "more text", "other text", "some text here"
    ), class = "factor")), .Names = c("id", "text"), class = "data.frame", 
    row.names = c(NA, -6L))

当然，还有

data.table

，它的语法非常紧凑（速度也非常快）：

>库（data.table）
>DT DT[，粘贴（文本，折叠=），by=“id”]
id V1
1:7821这里和这里都有一些文字
2:567等更多文本
3:231其他文本

谢谢，它工作得很好！我会尽快（在4分钟内）接受你的回答@Arun，完全同意，但是当你遇到这样的表格时，这里有一个技巧：复制并粘贴除第一行以外的所有内容，并使用

read.table

和

sep=“|”

和

strip.white=TRUE

aggregate(text ~ id, mydata, paste, collapse = " ")
#     id                                 text
# 1  231                           other text
# 2  567                   etcetera more text
# 3 7821 some text here here as well and here

> library(data.table)
> DT <- data.table(mydata)
> DT[, paste(text, collapse = " "), by = "id"]
     id                                   V1
1: 7821 some text here here as well and here
2:  567                   etcetera more text
3:  231                           other text