Performance 为什么r代码这么慢？_Performance_R_Dataframe

Performance 为什么r代码这么慢？

performance r dataframe

Performance 为什么r代码这么慢？,performance,r,dataframe,Performance,R,Dataframe,我正在尝试根据另一个数据帧中的信息创建一个数据帧第一个dataframe base_mar_bop的数据如下： 201301|ABC|4 201302|DEF|12 我希望在此基础上创建一个包含16行的数据帧： 4 times: 201301|ABC|1 12 times: 201302|DEF|1 我写了一个脚本，需要很长时间才能运行。为了了解情况，最终的数据帧大约有200万行，而源数据帧大约有10k行。由于数据的机密性，我无法发布数据帧的源文件由于运行这段代码花了很长时间，我决定用P

我正在尝试根据另一个数据帧中的信息创建一个数据帧

第一个dataframe base_mar_bop的数据如下：

201301|ABC|4
201302|DEF|12

我希望在此基础上创建一个包含16行的数据帧：

4 times: 201301|ABC|1
12 times: 201302|DEF|1

我写了一个脚本，需要很长时间才能运行。为了了解情况，最终的数据帧大约有200万行，而源数据帧大约有10k行。由于数据的机密性，我无法发布数据帧的源文件

由于运行这段代码花了很长时间，我决定用PHP来做这件事，它运行了不到一分钟就完成了工作，将它写入一个txt文件，然后在R中导入txt文件

我不知道为什么R需要这么长时间。。这是函数的调用吗？是嵌套for循环吗？从我的观点来看，这里没有那么多计算密集的步骤

# first create an empty dataframe called base_eop that will each subscriber on a row 

identified by CED, RATEPLAN and 1
# where 1 is the count and the sum of 1 should end up with the base
base_eop <-base_mar_bop[1,]

# let's give some logical names to the columns in the df
names(base_eop) <- c('CED','RATEPLAN','BASE')


# define the function that enables us to insert a row at the bottom of the dataframe
insertRow <- function(existingDF, newrow, r) {
  existingDF[seq(r+1,nrow(existingDF)+1),] <- existingDF[seq(r,nrow(existingDF)),]
  existingDF[r,] <- newrow
  existingDF
}


# now loop through the eop base for march, each row contains the ced, rateplan and number of subs
# we need to insert a row for each individual sub
for (i in 1:nrow(base_mar_eop)) {
  # we go through every row in the dataframe
  for (j in 1:base_mar_eop[i,3]) {
    # we insert a row for each CED, rateplan combination and set the base value to 1
    base_eop <- insertRow(base_eop,c(base_mar_eop[i,1:2],1),nrow(base_eop)) 
  }
}

# since the dataframe was created using the first row of base_mar_bop we need to remove this first row
base_eop <- base_eop[-1,]

我还没有尝试过任何基准测试，但在您的小示例中演示的这种方法应该快得多：

一个更现实的例子，包括时间：

d2 <- data.frame(CED=1:10000,RATEPLAN=rep(LETTERS[1:25],
         length.out=10000),BASE=200) 
nrow(d2) ## 10000
sum(d2$BASE)  ## 2e+06
system.time(d3 <- with(d2,
      data.frame(CED=rep(CED,BASE),RATEPLAN=rep(RATEPLAN,BASE),
              BASE=1)))
##   user  system elapsed 
## 0.244   0.860   1.117 
nrow(d3)  ## 2000000 (== 2e+06)

这里有一种使用data.table的方法，尽管@BenBolker的计时已经非常棒了

library(data.table)
DT <- data.table(d2)  ## d2 from @BenBolker's answer
out <- DT[, ID:=1:.N][rep(ID, BASE)][, `:=`(BASE=1, ID=NULL)]
out
#            CED RATEPLAN BASE
#       1:     1        A    1
#       2:     1        A    1
#       3:     1        A    1
#       4:     1        A    1
#       5:     1        A    1
#      ---                    
# 1999996: 10000        Y    1
# 1999997: 10000        Y    1
# 1999998: 10000        Y    1
# 1999999: 10000        Y    1
# 2000000: 10000        Y    1

您最好提前定义整个数据帧，然后填充它，而不是追加行。我认为帕特·伯恩斯的《地狱》中讨论了这一点。也可以考虑使用DATA表包进行这样的大操作。提供一个很小的真的，SMTH，你可以把代码放在可重复的示例DATABETE上，你的输出示例中的第二行在最后一个地方是第201302行DEF 1，而不是12个。@ BeBurkk在末尾更新了第1行。通过这种方式，你可以很容易地用任意数据组合出一个可复制的示例，例如data.frameCED=1:10000，RATEPLAN=repLETTERS[1:25]，length.out=10000，BASE=2000+1，但我很想从某人那里看到data.table解决方案。我仍然无法理解语法+1。惊人的速度提升@SimonO101，我不确定data.table是否一定比Ben的方法快。他们不同意我的回答，但我可能也没有非常有效地使用data.table。但是，语法要简洁得多，当你谈论是否必须多等100毫秒或少等100毫秒时，我认为关于速度的讨论有点愚蠢：另一种看起来稍微快一点的方法：DT[DT[，rep.I，BASE]][，BASE:=1]+1我看到data.table解决方案的动机不是速度，但是要理解语法是如何工作的。Bens方法对我来说更直观，但复合查询似乎如此强大，只要我能得到它就好了！

library(data.table)
DT <- data.table(d2)  ## d2 from @BenBolker's answer
out <- DT[, ID:=1:.N][rep(ID, BASE)][, `:=`(BASE=1, ID=NULL)]
out
#            CED RATEPLAN BASE
#       1:     1        A    1
#       2:     1        A    1
#       3:     1        A    1
#       4:     1        A    1
#       5:     1        A    1
#      ---                    
# 1999996: 10000        Y    1
# 1999997: 10000        Y    1
# 1999998: 10000        Y    1
# 1999999: 10000        Y    1
# 2000000: 10000        Y    1

out <- DT[rep(1:nrow(DT), BASE)][, BASE:=1]