R 提高ggplot2性能_R_Performance_Ggplot2

R 提高ggplot2性能

r performance

R 提高ggplot2性能,r,performance,ggplot2,R,Performance,Ggplot2,ggplot2软件包是我使用过的最好的绘图系统，只是对于较大的数据集（~50k点）性能不是很好。我希望通过Shiny提供web分析，使用ggplot2作为绘图后端，但我对性能不太满意，尤其是与基本图形相比。我的问题是，是否有任何具体的方法来提高这种性能起点是以下代码示例： library(ggplot2) n = 86400 # a day in seconds dat = data.frame(id = 1:n, val = sort(runif(n))) dev.new() gg_b

ggplot2

软件包是我使用过的最好的绘图系统，只是对于较大的数据集（~50k点）性能不是很好。我希望通过Shiny提供web分析，使用

ggplot2

作为绘图后端，但我对性能不太满意，尤其是与基本图形相比。我的问题是，是否有任何具体的方法来提高这种性能

起点是以下代码示例：

library(ggplot2)

n = 86400 # a day in seconds
dat = data.frame(id = 1:n, val = sort(runif(n)))

dev.new()

gg_base = ggplot(dat, aes(x = id, y = val))
gg_point = gg_base + geom_point()
gg_line = gg_base + geom_line()
gg_both = gg_base + geom_point() + geom_line()

benchplot(gg_point)
benchplot(gg_line)
benchplot(gg_both)
system.time(plot(dat))
system.time(plot(dat, type = 'l'))

我在MacPro视网膜上获得以下计时：

> benchplot(gg_point)
       step user.self sys.self elapsed
1 construct     0.000    0.000   0.000
2     build     0.321    0.078   0.398
3    render     0.271    0.088   0.359
4      draw     2.013    0.018   2.218
5     TOTAL     2.605    0.184   2.975
> benchplot(gg_line)
       step user.self sys.self elapsed
1 construct     0.000    0.000   0.000
2     build     0.330    0.073   0.403
3    render     0.622    0.095   0.717
4      draw     2.078    0.009   2.266
5     TOTAL     3.030    0.177   3.386
> benchplot(gg_both)
       step user.self sys.self elapsed
1 construct     0.000    0.000   0.000
2     build     0.602    0.155   0.757
3    render     0.866    0.186   1.051
4      draw     4.020    0.030   4.238
5     TOTAL     5.488    0.371   6.046
> system.time(plot(dat))
   user  system elapsed 
  1.133   0.004   1.138 
# Note that the timing below depended heavily on wether or net the graphics device
# was in view or not. Not in view made performance much, much better.
> system.time(plot(dat, type = 'l'))
   user  system elapsed 
  1.230   0.003   1.233

有关我的设置的更多信息：

> sessionInfo()
R version 2.15.3 (2013-03-01)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)

locale:
[1] C/UTF-8/C/C/C/C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] ggplot2_0.9.3.1

loaded via a namespace (and not attached):
 [1] MASS_7.3-23        RColorBrewer_1.0-5 colorspace_1.2-1   dichromat_2.0-0   
 [5] digest_0.6.3       grid_2.15.3        gtable_0.1.2       labeling_0.1      
 [9] munsell_0.4        plyr_1.8           proto_0.3-10       reshape2_1.2.2    
[13] scales_0.2.3       stringr_0.6.2

Hadley对他的新软件包和user2013很酷。但他自己可能更好地讲述更多

我不确定您的应用程序设计是什么样子的，但我经常在将数据馈送到R之前进行数据库预处理。例如，如果您正在绘制时间序列，实际上不需要在X轴上显示一天中的每一秒。相反，您可能希望在一分钟或五分钟的时间间隔内聚合并获得最小值/最大值/平均值

下面是我几年前写的一个函数的示例，它在SQL中做了类似的事情。此特定示例使用模运算符，因为时间存储为历元毫秒。但是，如果SQL中的数据被正确地存储为日期/日期时间结构，那么SQL有一些更优雅的本机方法可以按时间段进行聚合

#' @param table name of the table
#' @param start start time/date
#' @param end end time/date
#' @param aggregate one of "days", "hours", "mins" or "weeks"
#' @param group grouping variable
#' @param column name of the target column (y axis)
#' @export
minmaxdata <- function(table, start, end, aggregate=c("days", "hours", "mins", "weeks"), group=1, column){

  #dates
  start <- round(unclass(as.POSIXct(start))*1000);
  end <- round(unclass(as.POSIXct(end))*1000);

  #must aggregate
  aggregate <- match.arg(aggregate);

  #calcluate modulus
  mod <- switch(aggregate,
    "mins"   = 1000*60,
    "hours"  = 1000*60*60,
    "days"   = 1000*60*60*24,
    "weeks"  = 1000*60*60*24*7,
    stop("invalid aggregate value")
  );

  #we need to add the time differene between gmt and pst to make modulo work
  delta <- 1000 * 60 * 60 * (24 - unclass(as.POSIXct(format(Sys.time(), tz="GMT")) - Sys.time()));  

  #form query
  query <- paste("SELECT", group, "AS grouping, AVG(", column, ") AS yavg, MAX(", column, ") AS ymax, MIN(", column, ") AS ymin, ((CMilliseconds_g +", delta, ") DIV", mod, ") AS timediv FROM", table, "WHERE CMilliseconds_g BETWEEN", start, "AND", end, "GROUP BY", group, ", timediv;")
  mydata <- getquery(query);

  #data
  mydata$time <- structure(mod*mydata[["timediv"]]/1000 - delta/1000, class=c("POSIXct", "POSIXt"));
  mydata$grouping <- as.factor(mydata$grouping)

  #round timestamps
  if(aggregate %in% c("mins", "hours")){
    mydata$time <- round(mydata$time, aggregate)
  } else {
    mydata$time <- as.Date(mydata$time);
  }

  #return
  return(mydata)
}

#'@param表的表名
#“@param开始时间/日期
#“@param end结束时间/日期
#“@param聚合“天”、“小时”、“分钟”或“周”中的一个”
#“@param组分组变量
#“@param目标列的列名（y轴）
#“@出口
minmaxdata是否可以在多个核心上分布（单独）绘图或缓存满足您的需要？任何加快绘图速度的方法都是可以接受的，缓存并不是真正的解决方案，因为这个问题涉及的是用户实际需要绘制新绘图的情况（轴更改、线条颜色等）。ggplot2具有内置计时系统，benchplot（）
，以帮助确定它为何如此缓慢。+1！我同意聚合是一个很好的选择，这绝对值得探索。然而，我不确定客户（科学家）是否会对这种纯粹为了性能的平滑感到满意。这不仅仅是性能问题。眼睛根本无法读取轴上的86400个点，您的显示器也没有显示该点的分辨率。如果你想要viz大（ish）数据，你总是要做一些聚合，否则你的图会变得一团糟。我同意，但在这个例子中，我们只画了一条线。假设~100k点分布在几个面上，并添加平滑。通过这种方式，你可以很容易地得到一个好的绘图，它仍然需要绘制大量的数据。