R 故障加速算法_R_Performance_Dataframe

R 故障加速算法

r performance dataframe

R 故障加速算法,r,performance,dataframe,R,Performance,Dataframe,我在R中做了一个算法，将多个传感器读数组合在一个时间戳下大多数传感器读数每500毫秒读取一次，但有些传感器仅报告变化。因此，我必须制定一个算法，在给定时间获取传感器的最后一个已知值现在算法工作了，但是速度太慢了，当我开始用它来制作实际的20多个传感器时，它需要很长时间才能完成。我的假设是，由于我使用了数据帧，或者我访问和移动数据的方式，所以速度很慢我已经尝试通过只遍历每个数据帧一次，而不是对每个时间戳进行迭代来加快速度。我还预先分配了数据所需的所有空间欢迎提出任何建议。我对R语言非常陌生

我在R中做了一个算法，将多个传感器读数组合在一个时间戳下

大多数传感器读数每500毫秒读取一次，但有些传感器仅报告变化。因此，我必须制定一个算法，在给定时间获取传感器的最后一个已知值

现在算法工作了，但是速度太慢了，当我开始用它来制作实际的20多个传感器时，它需要很长时间才能完成。我的假设是，由于我使用了数据帧，或者我访问和移动数据的方式，所以速度很慢

我已经尝试通过只遍历每个数据帧一次，而不是对每个时间戳进行迭代来加快速度。我还预先分配了数据所需的所有空间

欢迎提出任何建议。我对R语言非常陌生，所以我不知道哪些数据类型慢，哪些快

library(tidyverse)
library(tidytext)
library(stringr)
library(readr)
library(dplyr)
library(pracma)    

# take a list of dataframes as a parameter
generalise_data <- function(dataframes, timeinterval){
  if (typeof(dataframes) == "list"){
    # get the biggest and smallest datetime stamp from every dataframe
    # this will be used to calculate the size of the resulting frame ((largest time - smallest time)/1000 = dataframe rows)
    # this means one value every second

    largest_time <- 0
    smallest_time <- as.numeric(Sys.time())*1000 # everything will be smaller than the current time
    for (i in 1:length(dataframes)){
      dataframe_max <- max(dataframes[[i]]$TIMESTAMP)
      dataframe_min <- min(dataframes[[i]]$TIMESTAMP)

      if (dataframe_max > largest_time) largest_time <- dataframe_max
      if (dataframe_min < smallest_time) smallest_time <- dataframe_min
    }

    # result dataframe wil have ... rows
    result.size <- floor((largest_time - smallest_time)/timeinterval)
    sprintf("Result size: %i", result.size)

    # create a numeric array that contains the indexes of every dataframe, all set to 1
    dataframe_indexes <- numeric(length(dataframes))
    dataframe_indexes[dataframe_indexes == 0] <- 1

    # data vectors for the dataframe
    result.timestamps <- numeric(result.size)
    result <- list(result.timestamps)
    for (i in 2:(length(dataframes)+1)) result[[i]] <- numeric(result.size) # add an empty vector for every datapoint

    # use progressbar
    pb <- txtProgressBar(1, result.size, style = 3)

    # make a for loop to run through every data row of the resulting data frame (creating a row every run through)
    # every run through increase the index of dataframes until the resulting row exceeds the result rows timestamp, than go one index back
    #for (i in 1:200){
    for (i in 1:result.size){
      current_timestamp <- smallest_time + timeinterval*(i-1)
      result[[1]][i] <- current_timestamp

      for (i2 in 1:length(dataframes)){
        while (dataframes[[i2]]$TIMESTAMP[dataframe_indexes[i2]] < current_timestamp && dataframes[[i2]]$TIMESTAMP[dataframe_indexes[i2]] != max(dataframes[[i2]]$TIMESTAMP)){
            dataframe_indexes[i2] <- dataframe_indexes[i2]+1
        }

        if (dataframe_indexes[i2] > 1){
            dataframe_indexes[i2] <- dataframe_indexes[i2]-1 # take the one that's smaller
        }

        result[[i2+1]][i] <- dataframes[[i2]]$VALUE[dataframe_indexes[i2]]
      }

      setTxtProgressBar(pb, i)
    }

    close(pb)

    result.final <- data.frame(result)

    return(result.final)
  } else {
    return(NA)
  }
}

库（tidyverse）
图书馆（tidytext）
图书馆（stringr）
图书馆（readr）
图书馆（dplyr）
图书馆（pracma）
#将数据帧列表作为参数
泛化数据我今天通过将每个数据帧更改为矩阵来修复它。代码运行时间为9.5秒，而不是70分钟
结论：数据帧对性能非常不利
library(tidyverse)
library(tidytext)
library(stringr)
library(readr)
library(dplyr)
library(pracma)
library(compiler)    

# take a list of dataframes as a parameter
generalise_data <- function(dataframes, timeinterval){
  time.start <- Sys.time()
  if (typeof(dataframes) == "list"){
    # store the sizes of all the dataframes
    resources.largest_size <- 0
    resources.sizes <- numeric(length(dataframes))

    for (i in 1:length(dataframes)){
      resources.sizes[i] <- length(dataframes[[i]]$VALUE)
      if (resources.sizes[i] > resources.largest_size) resources.largest_size <- resources.sizes[i]
    }

    # generate a matrix that can hold all needed dataframe values
    resources <- matrix(nrow = resources.largest_size, ncol = length(dataframes)*2)
    for (i in 1:length(dataframes)){
      j <- i*2
      resources[1:resources.sizes[i],j-1] <- dataframes[[i]]$TIMESTAMP
      resources[1:resources.sizes[i],j] <- dataframes[[i]]$VALUE
    }

    # get the biggest and smallest datetime stamp from every dataframe
    # this will be used to calculate the size of the resulting frame ((largest time - smallest time)/1000 = dataframe rows)
    # this means one value every second
    largest_time <- 0
    smallest_time <- as.numeric(Sys.time())*1000 # everything will be smaller than the current time
    for (i in 1:length(dataframes)){
      dataframe_max <- max(dataframes[[i]]$TIMESTAMP)
      dataframe_min <- min(dataframes[[i]]$TIMESTAMP)

      if (dataframe_max > largest_time) largest_time <- dataframe_max
      if (dataframe_min < smallest_time) smallest_time <- dataframe_min
    }

    # result dataframe wil have ... rows
    result.size <- floor((largest_time - smallest_time)/timeinterval)
    sprintf("Result size: %i", result.size)

    # create a numeric array that contains the indexes of every dataframe, all set to 1
    dataframe_indexes <- numeric(length(dataframes))
    dataframe_indexes[dataframe_indexes == 0] <- 1

    # data matrix for the result
    result <- matrix(data = 0, nrow = result.size, ncol = length(dataframes)+1)

    # use progressbar
    pb <- txtProgressBar(1, result.size, style = 3)

    # make a for loop to run through every data row of the resulting data frame (creating a row every run through)
    # every run through increase the index of dataframes until the resulting row exceeds the result rows timestamp, than go one index back
    #for (i in 1:200){
    for (i in 1:result.size){
      current_timestamp <- smallest_time + timeinterval*(i-1)
      result[i,1] <- current_timestamp

      for (i2 in 1:length(dataframes)){
        j <- i2*2
        while (resources[dataframe_indexes[i2],j-1] < current_timestamp && resources[dataframe_indexes[i2],j-1] != resources.sizes[i2]){
          dataframe_indexes[i2] <- dataframe_indexes[i2]+1
        }

        # at the moment the last value of the array is never selected, needs to be fixed
        if (dataframe_indexes[i2] > 1){
          dataframe_indexes[i2] <- dataframe_indexes[i2]-1 # take the one that's smaller
        }

        result[i,i2+1] <- resources[dataframe_indexes[i2], j] #dataframes[[i2]]$VALUE[dataframe_indexes[i2]]
      }

      setTxtProgressBar(pb, i)
    }

    close(pb)

    result.final <- data.frame(result)

    time.end <- Sys.time()
    print(time.end-time.start)

    return(result.final)
  } else {
    return(NA)
  }
}

库（tidyverse）
图书馆（tidytext）
图书馆（stringr）
图书馆（readr）
图书馆（dplyr）
图书馆（pracma）
库（编译器）
#将数据帧列表作为参数
泛化_数据如果代码有效，这可能属于。如果不深入研究您的代码，它会有很多循环。找到将代码矢量化的方法和/或使用类似sapply
的方法来消除一些循环可能会有所帮助。Gillespie和Lovelace写的一本书《高效R编程》我非常仔细地选择了我的循环唯一一本耗时超过50毫秒的是：for（I in 1:result.size）{
但是其中有一个for循环，这可能会让它慢一点…感谢您在code ReviewWithfor..for…中发布它的建议，
您有深度嵌套的循环，这可能就是问题所在。有一些分析工具可用于准确指出问题所在。@Milan请提供e一些数据，以便能够运行您的代码。@F.Privé当我下班回家时，我将尝试编写一个脚本，该脚本可以生成一些数据来运行函数