R 如何通过多列拆分/聚合大数据帧(ffdf)?

R 如何通过多列拆分/聚合大数据帧(ffdf)?,r,R,ffbase提供函数ffdfdply来拆分和聚合数据行。这个答案()解释了这基本上是如何工作的。我仍然不知道如何按多个列进行拆分 我的挑战是需要一个拆分变量。这对于两个变量的每个组合都必须是唯一的,我想用它来分割。不过,在我的4列数据帧(大约50万行)中,如果通过paste()创建字符向量,则需要大量内存 这就是我被卡住的地方 require("ff") require("ffbase") load.ffdf(dir="ffdf.shares.02") # Aggregation by art

ffbase
提供函数
ffdfdply
来拆分和聚合数据行。这个答案()解释了这基本上是如何工作的。我仍然不知道如何按多个列进行拆分

我的挑战是需要一个拆分变量。这对于两个变量的每个组合都必须是唯一的,我想用它来分割。不过,在我的4列数据帧(大约50万行)中,如果通过
paste()
创建字符向量,则需要大量内存

这就是我被卡住的地方

require("ff")
require("ffbase")
load.ffdf(dir="ffdf.shares.02")

# Aggregation by articleID/measure
levels(ffshares$measure) #  "comments", "likes", "shares", "totals", "tw"
splitBy = paste(as.character(ffshares$articleID), ffshares$measure, sep="")

tmp = ffdfdply(fftest, split=splitBy, FUN=function(x) {
  return(list(
    "articleID" = x[1,"articleID"],
    "measure" = x[1,"measure"],
    # I need vectors for each entry
    "sx" = unlist(x$value), 
    "st" = unlist(x$time)
  ))
}
)
当然,我可以为
ffshares$measure
使用较短的级别,或者简单地使用数字代码,但这仍然无法解决
splitBy
变得巨大的根本问题

样本数据

    articleID  measure                time value
100        41   shares 2015-01-03 23:20:34     4
101        41       tw 2015-01-03 23:30:30    24
102        41   totals 2015-01-03 23:30:38     6
103        41    likes 2015-01-03 23:30:38     2
104        41 comments 2015-01-03 23:30:38     0
105        41   shares 2015-01-03 23:30:38     4
106        41       tw 2015-01-03 23:40:24    24
107        41   totals 2015-01-03 23:40:35     6
108        41    likes 2015-01-03 23:40:35     2
...
1000       42   shares 2015-01-04 20:10:50     0
1001       42       tw 2015-01-04 21:10:45    24
1002       42   totals 2015-01-04 21:10:35     0
1003       42    likes 2015-01-04 21:10:35     0
1004       42 comments 2015-01-04 21:10:35     0
1005       42   shares 2015-01-04 21:10:35     0
1006       42       tw 2015-01-04 22:10:45    24
1007       42   totals 2015-01-04 22:10:43     0
1008       42    likes 2015-01-04 22:10:43     0
...
#使用此选项,可以确保数据不会完全进入RAM,而只能以100000条记录的块形式进入RAM

ffshares$splitBy您能提供示例数据吗?不客气。这是非常简单的数据——只是其中的一部分:)嗯,paste()和ffdfdply()命令都让R工作了一段时间。可能是由于我的数据中的400k拆分级别造成的。不过,你的解决方案还是奏效了。谢谢!
# Use this, this makes sure your data does not get into RAM completely but only in chunks of 100000 records
ffshares$splitBy <- with(ffshares[c("articleID", "measure")], paste(articleID, measure, sep=""), 
                         by = 100000)
length(levels(ffshares$splitBy)) ## how many levels are in there - don't know from your question

tmp <- ffdfdply(ffshares, split=ffshares$splitBy, FUN=function(x) {
  ## In x you are getting a data.frame in RAM with all records of possibly several articleID/measure combinations
  ## You should write a function which returns a data.frame. E.g. the following returns the mean value by articleID/measure and the first and last timepoint
  x <- data.table::setDT(x)
  xagg <- x[, list(value = mean(value), 
                   first.timepoint = min(time),
                   last.timepoint = max(time)), by = list(articleID, measure)]
  ## the function should return a data frame as indicated in the help of ffdfdply, not a list
  setDF(xagg)
})
## tmp is an ffdf