Function &引用；“平滑化”；时间数据-可以更高效地完成吗？_Function_R_Datetime_Performance

Function &引用；“平滑化”；时间数据-可以更高效地完成吗？

function r datetime performance

Function &引用；“平滑化”；时间数据-可以更高效地完成吗？,function,r,datetime,performance,Function,R,Datetime,Performance,我有一个包含ID、开始日期和结束日期的数据框。我的数据按ID、开始、结束（按此顺序）排序现在，我希望将具有相同ID且时间跨度重叠（或开始日期正好在另一行的结束日期后一天）的所有行合并在一起合并它们意味着它们最终在一行中具有相同的ID，最小值（开始日期）和最大值（结束日期）（我希望您理解我的意思）我已经为此编写了一个函数（它没有经过充分测试，但目前看起来还不错）。问题是，由于我的数据框有近10万个观测值，因此函数非常慢你能帮我提高工作效率吗下面是函数 smoothingEpisodes

我有一个包含ID、开始日期和结束日期的数据框。我的数据按ID、开始、结束（按此顺序）排序

现在，我希望将具有相同ID且时间跨度重叠（或开始日期正好在另一行的结束日期后一天）的所有行合并在一起

合并它们意味着它们最终在一行中具有相同的ID，最小值（开始日期）和最大值（结束日期）（我希望您理解我的意思）

我已经为此编写了一个函数（它没有经过充分测试，但目前看起来还不错）。问题是，由于我的数据框有近10万个观测值，因此函数非常慢

你能帮我提高工作效率吗

下面是函数

smoothingEpisodes <- function (theData) {
    theOutput <- data.frame()

    curId <- theData[1, "ID"]
    curStart <- theData[1, "START"]
    curEnd <- theData[1, "END"]

    for(i in 2:nrow(theData)) {
        nextId <- theData[i, "ID"]
        nextStart <- theData[i, "START"]
        nextEnd <- theData[i, "END"]

        if (curId != nextId | (curEnd + 1) < nextStart) {
            theOutput <- rbind(theOutput, data.frame("ID" = curId, "START" = curStart, "END" = curEnd))

            curId <- nextId
            curStart <- nextStart
            curEnd <- nextEnd
        } else {
            curEnd <- max(curEnd, nextEnd, na.rm = TRUE)
        }
    }
    theOutput <- rbind(theOutput, data.frame("ID" = curId, "START" = curStart, "END" = curEnd))

    theOutput
}

（开始和结束的数据类型为“日期”，ID为数字）

数据的dput：

structure(list(ID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L), START = structure(c(10957, 
11048, 11062, 11201, 10971, 10988, 11048, 11109, 11139), class = "Date"), 
    END = structure(c(11047, 11108, 11169, 11261, 11047, 11031, 
    11062, 11123, 11153), class = "Date")), .Names = c("ID", 
"START", "END"), class = "data.frame", row.names = c(NA, 9L))

我建议的第一种优化方法[不用认真考虑你要做什么]是为输出分配存储空间。目前，您在循环的每次迭代中都在增加输出。在R中，这是一个绝对的否！！这是你永远不会做的事情，除非你喜欢非常慢的代码。R必须在每次迭代中复制对象并展开它，这很慢

查看代码，我们知道输出需要有

nrow（theData）-1行和3列。因此，在循环开始之前创建：
theOutput <- data.frame(matrix(ncol = 3, nrow = nrow(theData) - 1))

然而，我能给你的最好的提示是分析你的代码。看看瓶颈在哪里，并加快速度。在较小的数据子集上运行函数；其大小足以为您提供一点运行时来收集有用的分析数据，而无需等待很长时间才能完成分析运行。要在R中评测，请使用Rprof（）
：
您可以使用
summaryRprof("my_fun_profile.Rprof")

Hadley Wickham（@Hadley）有一个软件包可以让这更容易一些。它被称为。正如德克在评论中提醒我的，还有卢克·蒂尔尼的包裹
编辑：由于OP提供了一些测试数据，我很快找到了一些东西，以显示通过遵循良好的循环实践实现的加速：
smoothingEpisodes2 <- function (theData) {
    curId <- theData[1, "ID"]
    curStart <- theData[1, "START"]
    curEnd <- theData[1, "END"]
    nr <- nrow(theData)
    out1 <- integer(length = nr)
    out2 <- out3 <- numeric(length = nr)
    for(i in 2:nrow(theData)) {
        nextId <- theData[i, "ID"]
        nextStart <- theData[i, "START"]
        nextEnd <- theData[i, "END"]
        if (curId != nextId | (curEnd + 1) < nextStart) {
            out1[i-1] <- curId
            out2[i-1] <- curStart
            out3[i-1] <- curEnd
            curId <- nextId
            curStart <- nextStart
            curEnd <- nextEnd
        } else {
            curEnd <- max(curEnd, nextEnd, na.rm = TRUE)
        }
    }
    out1[i] <- curId
    out2[i] <- curStart
    out3[i] <- curEnd
    theOutput <- data.frame(ID = out1,
                            START = as.Date(out2, origin = "1970-01-01"),
                            END = as.Date(out3, origin = "1970-01-01"))
    ## drop empty
    theOutput <- theOutput[-which(theOutput$ID == 0), ]
    theOutput
}

速度提高50%。不是戏剧性的，但只需在每次迭代中不增加对象即可实现。
我建议的第一种优化方法[不必认真考虑您要做什么]是为输出分配存储空间。目前，您在循环的每次迭代中都在增加输出。在R中，这是一个绝对的否！！这是你永远不会做的事情，除非你喜欢非常慢的代码。R必须在每次迭代中复制对象并展开它，这很慢
查看代码，我们知道输出需要有nrow（theData）-1行和3列。因此，在循环开始之前创建：
theOutput <- data.frame(matrix(ncol = 3, nrow = nrow(theData) - 1))

然而，我能给你的最好的提示是分析你的代码。看看瓶颈在哪里，并加快速度。在较小的数据子集上运行函数；其大小足以为您提供一点运行时来收集有用的分析数据，而无需等待很长时间才能完成分析运行。要在R中评测，请使用Rprof（）
：
您可以使用
summaryRprof("my_fun_profile.Rprof")

Hadley Wickham（@Hadley）有一个软件包可以让这更容易一些。它被称为。正如德克在评论中提醒我的，还有卢克·蒂尔尼的包裹
编辑：由于OP提供了一些测试数据，我很快找到了一些东西，以显示通过遵循良好的循环实践实现的加速：
smoothingEpisodes2 <- function (theData) {
    curId <- theData[1, "ID"]
    curStart <- theData[1, "START"]
    curEnd <- theData[1, "END"]
    nr <- nrow(theData)
    out1 <- integer(length = nr)
    out2 <- out3 <- numeric(length = nr)
    for(i in 2:nrow(theData)) {
        nextId <- theData[i, "ID"]
        nextStart <- theData[i, "START"]
        nextEnd <- theData[i, "END"]
        if (curId != nextId | (curEnd + 1) < nextStart) {
            out1[i-1] <- curId
            out2[i-1] <- curStart
            out3[i-1] <- curEnd
            curId <- nextId
            curStart <- nextStart
            curEnd <- nextEnd
        } else {
            curEnd <- max(curEnd, nextEnd, na.rm = TRUE)
        }
    }
    out1[i] <- curId
    out2[i] <- curStart
    out3[i] <- curEnd
    theOutput <- data.frame(ID = out1,
                            START = as.Date(out2, origin = "1970-01-01"),
                            END = as.Date(out3, origin = "1970-01-01"))
    ## drop empty
    theOutput <- theOutput[-which(theOutput$ID == 0), ]
    theOutput
}

速度提高50%。不是戏剧性的，但通过在每次迭代中不增加对象就可以简单地实现。
为了避免最后删除空行，我做了一些不同的事情：
smoothingEpisodes <- function (theData) {
    curId <- theData[1, "ID"]
    curStart <- theData[1, "START"]
    curEnd <- theData[1, "END"]

    theLength <- nrow(theData)

    out.1 <- integer(length = theLength)
    out.2 <- out.3 <- numeric(length = theLength)

    j <- 1

    for(i in 2:nrow(theData)) {
        nextId <- theData[i, "ID"]
        nextStart <- theData[i, "START"]
        nextEnd <- theData[i, "END"]

        if (curId != nextId | (curEnd + 1) < nextStart) {
            out.1[j] <- curId
            out.2[j] <- curStart
            out.3[j] <- curEnd

            j <- j + 1

            curId <- nextId
            curStart <- nextStart
            curEnd <- nextEnd
        } else {
            curEnd <- max(curEnd, nextEnd, na.rm = TRUE)
        }
    }

    out.1[j] <- curId
    out.2[j] <- curStart
    out.3[j] <- curEnd

    theOutput <- data.frame(ID = out.1[1:j], START = as.Date(out.2[1:j], origin = "1970-01-01"), END = as.Date(out.3[1:j], origin = "1970-01-01"))

    theOutput
}

SmoothingSpices为了避免最后删除空行，我做了一些不同的操作：
smoothingEpisodes <- function (theData) {
    curId <- theData[1, "ID"]
    curStart <- theData[1, "START"]
    curEnd <- theData[1, "END"]

    theLength <- nrow(theData)

    out.1 <- integer(length = theLength)
    out.2 <- out.3 <- numeric(length = theLength)

    j <- 1

    for(i in 2:nrow(theData)) {
        nextId <- theData[i, "ID"]
        nextStart <- theData[i, "START"]
        nextEnd <- theData[i, "END"]

        if (curId != nextId | (curEnd + 1) < nextStart) {
            out.1[j] <- curId
            out.2[j] <- curStart
            out.3[j] <- curEnd

            j <- j + 1

            curId <- nextId
            curStart <- nextStart
            curEnd <- nextEnd
        } else {
            curEnd <- max(curEnd, nextEnd, na.rm = TRUE)
        }
    }

    out.1[j] <- curId
    out.2[j] <- curStart
    out.3[j] <- curEnd

    theOutput <- data.frame(ID = out.1[1:j], START = as.Date(out.2[1:j], origin = "1970-01-01"), END = as.Date(out.3[1:j], origin = "1970-01-01"))

    theOutput
}

SmoothingSectionsMarcel，我想我应该试着改进一下你的代码。下面的版本大约快30倍（从3秒到0.1秒）。。。诀窍是首先将三列提取为整数和双向量
作为补充说明，我尝试在适用的地方使用[[
，并尝试通过编写jMarcel来保持整数为整数，我认为我应该尝试改进一下您的代码。下面的版本大约快30倍（从3秒到0.1秒）…诀窍是首先将三列提取为整数和双向量
作为旁注，我尝试在适用的情况下使用[[
，并尝试通过写入j将整数作为整数。dput（）
的输出更有用，因为我们需要对象是日期。dput（）的输出
更有用，因为我们需要对象是日期。感谢您的提示！事实上，输出可以有1到nrow（theData）行。但是，您对列的看法是正确的。如果我用nrow（theData）初始化输出，最后有没有办法去掉空行？@Marcel是的，我刚刚发布的例子就是这样的。在这个例子中，res1
和res2
除了行名之外是相等的。@Dirk-很好的一点，我已经忘记了。我提到Hadley的profr不是一种认可。我没有使用过它，因此无法给出一个informed opinion.谢谢你的提示！事实上输出可以有1到nrow（theData）行。但是，你对列的看法是正确的。如果我用nrow（theData）初始化输出，最后有没有办法去掉空行？@Marcel是的，我刚刚发布的例子就是这样的。在这个例子中，res1
和res2除了行名之外是相等的。@Dirk-很好的一点，我已经忘记了。我提到Hadley的profr不是一种认可。我没有使用过它，因此无法给出一个inf我想告诉你：）我想任何
smoothingEpisodes <- function (theData) {
    curId <- theData[1, "ID"]
    curStart <- theData[1, "START"]
    curEnd <- theData[1, "END"]

    theLength <- nrow(theData)

    out.1 <- integer(length = theLength)
    out.2 <- out.3 <- numeric(length = theLength)

    j <- 1

    for(i in 2:nrow(theData)) {
        nextId <- theData[i, "ID"]
        nextStart <- theData[i, "START"]
        nextEnd <- theData[i, "END"]

        if (curId != nextId | (curEnd + 1) < nextStart) {
            out.1[j] <- curId
            out.2[j] <- curStart
            out.3[j] <- curEnd

            j <- j + 1

            curId <- nextId
            curStart <- nextStart
            curEnd <- nextEnd
        } else {
            curEnd <- max(curEnd, nextEnd, na.rm = TRUE)
        }
    }

    out.1[j] <- curId
    out.2[j] <- curStart
    out.3[j] <- curEnd

    theOutput <- data.frame(ID = out.1[1:j], START = as.Date(out.2[1:j], origin = "1970-01-01"), END = as.Date(out.3[1:j], origin = "1970-01-01"))

    theOutput
}

smoothingEpisodes3 <- function (theData) {
    theLength <- nrow(theData)
    if (theLength < 2L) return(theData)

    id <- as.integer(theData[["ID"]])
    start <- as.numeric(theData[["START"]])
    end <- as.numeric(theData[["END"]])

    curId <- id[[1L]]
    curStart <- start[[1L]]
    curEnd <- end[[1L]]

    out.1 <- integer(length = theLength)
    out.2 <- out.3 <- numeric(length = theLength)

    j <- 1L

    for(i in 2:nrow(theData)) {
        nextId <- id[[i]]
        nextStart <- start[[i]]
        nextEnd <- end[[i]]

        if (curId != nextId | (curEnd + 1) < nextStart) {
            out.1[[j]] <- curId
            out.2[[j]] <- curStart
            out.3[[j]] <- curEnd

            j <- j + 1L

            curId <- nextId
            curStart <- nextStart
            curEnd <- nextEnd
        } else {
            curEnd <- max(curEnd, nextEnd, na.rm = TRUE)
        }
    }

    out.1[[j]] <- curId
    out.2[[j]] <- curStart
    out.3[[j]] <- curEnd

    theOutput <- data.frame(ID = out.1[1:j], START = as.Date(out.2[1:j], origin = "1970-01-01"), END = as.Date(out.3[1:j], origin = "1970-01-01"))

    theOutput
}

x <- structure(list(ID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L), START = structure(c(10957, 
11048, 11062, 11201, 10971, 10988, 11048, 11109, 11139), class = "Date"), 
    END = structure(c(11047, 11108, 11169, 11261, 11047, 11031, 
    11062, 11123, 11153), class = "Date")), .Names = c("ID", 
"START", "END"), class = "data.frame", row.names = c(NA, 9L))

r <- 1000
y <- data.frame(ID=rep(x$ID, r) + rep(1:r, each=nrow(x))-1, START=rep(x$START, r), END=rep(x$END, r))

system.time( a1 <- smoothingEpisodes(y) )   # 2.95 seconds
system.time( a2 <- smoothingEpisodes3(y) )  # 0.10 seconds
all.equal( a1, a2 )