在R中，在移动的日期窗口上缩放变量：脚本可以工作，但速度慢得令人无法接受。优化的方法？rstats_R_For Loop_Dplyr_Iteration_Purrr

在R中，在移动的日期窗口上缩放变量：脚本可以工作，但速度慢得令人无法接受。优化的方法？rstats

r for-loop

在R中，在移动的日期窗口上缩放变量：脚本可以工作，但速度慢得令人无法接受。优化的方法？rstats,r,for-loop,dplyr,iteration,purrr,R,For Loop,Dplyr,Iteration,Purrr,我有一个数据框，其中每行表示特定日期特定类别的数据： set.seed(1) k <- 10 df <- data.frame( name = c(rep('a',k), rep('b',k)), date = rep(seq(as.Date('2017-01-01'),as.Date('2017-01-01')+k-1, 'days'),2), x = runif(2*k,1,20), y = runif(2*k,100,300) ) Vi

我有一个数据框，其中每行表示特定日期特定类别的数据：

set.seed(1)
k <- 10
df <- data.frame(
    name = c(rep('a',k), rep('b',k)), 
    date = rep(seq(as.Date('2017-01-01'),as.Date('2017-01-01')+k-1, 'days'),2),
    x = runif(2*k,1,20),
    y = runif(2*k,100,300)
    )
View(df)

结构：

str(df)
'data.frame':   20 obs. of  4 variables:
 $ name: Factor w/ 2 levels "a","b": 1 1 1 1 1 1 1 1 1 1 ...
 $ date: Date, format: "2017-01-01" "2017-01-02" "2017-01-03" "2017-01-04" ...
 $ x   : num  6.04 8.07 11.88 18.26 4.83 ...
 $ y   : num  287 142 230 125 153 ...

我需要在特定的日期窗口上缩放此数据的x和y变量。我想出的剧本如下：

library(dplyr)
library(lubridate)
df2 <- df
moving_window_days <- 4

##Iterate over each row in df
for(i in 1:nrow(df)){ 
    df2[i,] <- df %>% 
        ##Give me only rows for 'name' on the current row 
        ##which are within the date window of interest
        filter(date <= date(df[i,"date"]) & 
               date >= date(df[i,"date"]) - moving_window_days & 
               name == df[i,"name"]
               ) %>% 
        ##Now scale x and y on this date wondow
        mutate(x = percent_rank(x), 
               y = percent_rank(y)
        ) %>% 
        ##Get rid of the rest of the rows - leave only the row we are looking at
        filter(date == date(df[i,"date"])) 
}

库（dplyr）
图书馆（lubridate）
df2%
##去掉其余的行-只留下我们正在查看的行
过滤器（日期==日期（df[i，“日期”]））
}

它按预期工作（我最初希望在移动窗口中获取每个观察值的百分位数，但缩放值可以正常工作）问题在于实际数据集要大得多：

```
“名称”
```
栏有30个本地分支机构
```
“日期”
```
是每个分支机构至少一年的数据
我有6个变量，而不是
```
'x'
```
和
```
'y'
```
移动窗口为90天

我在真实数据上运行了这个脚本，在30000行中，它只能在4小时内浏览5000行。。。这是我第一次遇到这样的问题
我确信我的脚本效率很低（我确信，因为我不是R方面的专家。我只是假设总有更好的方法）
有什么方法可以优化/改进此脚本吗？

有没有“purrify”（使用
purrr
中的一些
map
函数）的方法
嵌套数据帧<代码>嵌套（）？认为这是一个解决方案。。。不知道如何实施

我能做些什么来以不同的方式解决问题？你能做的一件事是并行处理。为此，我使用了
future
包。这可能会惹恼一些人，他们可能会认为这是一个黑客，因为未来的计划是有意的。好。。。对于未来（或“承诺”，如果您是前端开发人员）。这种方法很挑剔，但效果很好

library(future) # Create a function that iterates over each row in the df: my_function <- function(df, x) { x <- df for(i in 1:nrow(df)){ x[i, ] <- df %>% ##Give me only rows for 'name' on the current row ##which are within the date window of interest filter(date <= date(df[i,"date"]) & date >= date(df[i,"date"]) - moving_window_days & name == df[i,"name"] ) %>% ##Now scale x and y on this date wondow mutate(x = percent_rank(x), y = percent_rank(y) ) %>% ##Get rid of the rest of the rows - leave only the row we are looking at filter(date == date(df[i,"date"])) } return(x) } plan(multiprocess) # make sure to always include this in a run of the code. # Divide df evenly into three separate dataframes: df1 %<-% my_function(df[1:7, ], df1) df2 %<-% my_function(df = df[(8 - moving_window_days):14, ], df2) # But from here on out, go back 4 days to include that data in the moving average calculation. df3 %<-% my_function(df = df[(15 - moving_window_days):20, ], df3) # See if your computer is able to split df into 4 or 5 separate dataframes. # Now bind the dataframes together, but get the indexing right: rbind(df1, df2[(nrow(df2) - 6):nrow(df2), ], df3[(nrow(df3) - 5):nrow(df3), ])

库（未来） #创建一个迭代df中每一行的函数：我的_函数% ##去掉其余的行-只留下我们正在查看的行过滤器（日期==日期（df[i，“日期”]）） } 返回（x） } 计划（多进程）#确保在运行代码时始终包含这一点。 #将df平均划分为三个独立的数据帧： df1%@OP您应该对提供的答案保持谨慎 --我最初的答案-- library(tidyverse) 我首先将数据框拆分为按名称分组的数据框列表 split.df <- split(df, df$name) 这将产生一个列表。转换回单个数据帧的步骤 final <- Reduce("rbind",new) 让我们确保我的结果和你的一致 identical(final$x, OP.output$x) [1] TRUE --我的原始答案结束-- library(tidyverse) -------------------------------------比较解决方案---------------------- library(tidyverse) @Brian的回答-- @布赖恩的回答与你期望的结果不一样。您说过您的函数按预期工作，所以让我们将Brian的结果与您的结果进行比较。第一个显示了Brian的结果。第二个显示您的结果 name date x y x2 y2 1 a 2017-01-01 6.044665 286.9410 0.0000000 1.0000000 2 a 2017-01-02 8.070354 142.4285 0.0000000 1.0000000 3 a 2017-01-03 11.884214 230.3348 0.3333333 0.3333333 4 a 2017-01-04 18.255948 125.1110 0.3333333 1.0000000 name date x y 1 a 2017-01-01 0.0000000 0.00 2 a 2017-01-02 1.0000000 0.00 3 a 2017-01-03 1.0000000 0.50 4 a 2017-01-04 1.0000000 0.00 identical(Brian.output$x2, OP.output$x, ) [1] FALSE --END@Brian的答案-- library(tidyverse) -@奥德修斯的答案-- library(tidyverse) @Odysseus的答案返回正确的结果，因为它使用相同的函数，但您必须手动拆分数据帧。查看下面调用my_函数的代码 df1 %<-% my_function(df[1:7, ], df1) df2 %<-% my_function(df = df[(8 - moving_window_days):14, ], df2) # But from here on out, go back 4 days to include that data in the moving average calculation. df3 %<-% my_function(df = df[(15 - moving_window_days):20, ], df3) df1%zoo:：rollapply 可以非常快 df2 <- df %>% group_by(name) %>% mutate(x2 = zoo::rollapply(x, width = 4, FUN = percent_rank, fill = "extend")[,1], y2 = zoo::rollapply(y, width = 4, FUN = percent_rank, fill = "extend")[,1]) 您当然可以使用purrr:：map2 对6个变量进行迭代（而不是在mutate 中调用rollappy 6次），但我不确定它是否会大大加快速度。这更像是一个主题，因此您希望计算每个观察值的百分位数，基于当前和之前的四个时期？@Odysseus210也许你是对的，但是R在那里没有得到太多的评论。。。我知道这里回答了很多R问题。百分比或基于前几天的X进行缩放。示例代码为4天，实际数据为90天。您是在mac还是windows上运行？还是别的什么？您必须加载lubridate 包才能克服错误。谢谢。成功了