R-根据过去的观察结果创建（变异）一个新列_R_Dataframe_Calculated Columns_Mutate

R-根据过去的观察结果创建（变异）一个新列

r dataframe

R-根据过去的观察结果创建（变异）一个新列,r,dataframe,calculated-columns,mutate,R,Dataframe,Calculated Columns,Mutate,我有一个相当大的数据集，大约有500个观测值和3个变量。第一列是指时间对于我使用的测试数据集： dat=as.data.frame(matrix(c(1,2,3,4,5,6,7,8,9,10, 1,1.8,3.5,3.8,5.6,6.2,7.8,8.2,9.8,10.1, 2,4.8,6.5,8.8,10.6,12.2,14.8,16.2,18.8,20.1),10,3)) colnames(dat)=c("Time","Var1","Var2") T

我有一个相当大的数据集，大约有500个观测值和3个变量。第一列是指时间

对于我使用的测试数据集：

dat=as.data.frame(matrix(c(1,2,3,4,5,6,7,8,9,10,
        1,1.8,3.5,3.8,5.6,6.2,7.8,8.2,9.8,10.1,
        2,4.8,6.5,8.8,10.6,12.2,14.8,16.2,18.8,20.1),10,3))
colnames(dat)=c("Time","Var1","Var2")


   Time Var1 Var2
1     1  1.0  2.0
2     2  1.8  4.8
3     3  3.5  6.5
4     4  3.8  8.8
5     5  5.6 10.6
6     6  6.2 12.2
7     7  7.8 14.8
8     8  8.2 16.2
9     9  9.8 18.8
10   10 10.1 20.1

所以我需要做的是创建一个新的列，每个观测值都是相对于一些过去点的时间的斜率。例如，以过去的3点为例，它将类似于：

slopeVar1[i]=slope(Var1[i-2:i],Time[i-2:i]) #Not real code
slopeVar[i]=slope(Var2[i-2:i],Time[i-2:i])  #Not real code

    Time    Var1    Var2    slopeVar1   slopeVar2
1   1       1       2       NA          NA
2   2       1.8     4.8     NA          NA
3   3       3.5     6.5     1.25        2.25
4   4       3.8     8.8     1.00        2.00
5   5       5.6     10.6    1.05        2.05
6   6       6.2     12.2    1.20        1.70
7   7       7.8     14.8    1.10        2.10
8   8       8.2     16.2    1.00        2.00
9   9       9.8     18.8    1.00        2.00
10  10      10.1    20.1    0.95        1.95

实际上，我已经使用了for（）函数，但是对于非常大的数据集（>100000）来说，它开始花费太长的时间

我使用的for（）参数如下所示：

#CREATE DATA FRAME
rm(dat)
  dat=as.data.frame(matrix(c(1,2,3,4,5,6,7,8,9,10,
              1,1.8,3.333,3.8,5.6,6.2,7.8,8.2,9.8,10.1,
              2,4.8,6.5,8.8,10.6,12.2,14.8,16.2,18.8,20.1),10,3))
  colnames(dat)=c("Time","Var1","Var2")
  dat
  plot(dat)

#CALCULATE SLOPE OF n POINTS FROM i TO i-n.
#In this case I am taking just 3 points, but it should 
#be possible to change the number of points taken. 

attach(dat)
n=3 #number for points to take slope
l=dim(dat[1])[1] #number of iterations
y=0
x=0
slopeVar1=NA
slopeVar2=NA
for (i in 1:l) {
    if   (i<n) {slopeVar1[i]=NA} #For the rows where there are not enough previous observations, it outputs NA
    if   (i>=n) {
      y1=Var1[(i-n+1):i] #y data sets for calculating slope of Var1
      y2=Var2[(i-n+1):i]#y data sets for calculating slope of Var2
      x=Time[(i-n+1):i] #x data sets for calculating slope of Var1&Var2

          z1=lm(y1~x) #Temporal value of slope of Var1
          z2=lm(y2~x) #Temporal value of slope of Var2
          slope1=as.data.frame(z1[1]) #Temporal value of slope of Var1
          slopeVar1[i]=slope1[2,1] #Populating string of slopeVar1
          slope2=as.data.frame(z2[1])#Temporal value of slope of Var2
          slopeVar2[i]=slope2[2,1] #Populating string of slopeVar2
          }
 }
slopeVar1 #Checking results. 
slopeVar2

(result=cbind(dat,slopeVar1,slopeVar2)) #Binds original data with new calculated slopes.

#创建数据帧
rm（dat）
dat=原始数据帧（矩阵c（1,2,3,4,5,6,7,8,9,10，
1,1.8,3.333,3.8,5.6,6.2,7.8,8.2,9.8,10.1,
2,4.8,6.5,8.8,10.6,12.2,14.8,16.2,18.8,20.1),10,3))
colnames（dat）=c（“时间”、“变量1”、“变量2”）
dat
绘图（dat）
#计算n个点从i到i-n的斜率。
#在这种情况下，我只拿了3分，但应该是这样
#可以更改所取点数。
附加（dat）
n=3#取斜率的点数
l=dim（dat[1]）[1]#迭代次数
y=0
x=0
slopeVar1=NA
slopeVar2=NA
对于（1:1中的i）{
如果（i=n）{
y1=Var1[（i-n+1）：i]#y用于计算Var1斜率的数据集
y2=Var2[（i-n+1）：i]#y用于计算Var2斜率的数据集
x=时间[（i-n+1）：i]#x用于计算Var1和Var2斜率的数据集
z1=lm（y1~x）#Var1斜率的时间值
z2=lm（y2~x）#Var2斜率的时间值
slope1=as.data.frame（z1[1]）#Var1斜率的时间值
slopeVar1[i]=slope1[2,1]#填充slopeVar1的字符串
slope2=as.data.frame（z2[1]）#Var2斜率的时间值
slopeVar2[i]=slope2[2,1]#填充slopeVar2的字符串
}
}
slopeVar1#检查结果。
slopeVar2
（结果=cbind（dat、slopeVar1、slopeVar2））#将原始数据与新计算的斜率绑定。

这段代码实际上输出了我想要的东西；但是，对于真正大的数据集来说，效率非常低

这种快速的

rollapply

实现似乎在某种程度上加快了速度-

library("zoo")
slope_func = function(period) { 
  y1=period[,2] #y data sets for calculating slope of Var1
  y2=period[,3] #y data sets for calculating slope of Var2
  x=period[,1] #x data sets for calculating slope of Var1&Var2
  z1=lm(y1~x) #Temporal value of slope of Var1
  z2=lm(y2~x) #Temporal value of slope of Var2
  slope1=as.data.frame(z1[1]) #Temporal value of slope of Var1
  slopeVar1[i]=slope1[2,1] #Populating string of slopeVar1
  slope2=as.data.frame(z1[1])#Temporal value of slope of Var2
  slopeVar2[i]=slope2[2,1] #Populating string of slopeVar2
  }
}

start = Sys.time()
rollapply(dat[1:3], FUN=slope_func, width=3, by.column=FALSE)
end=Sys.time()
print(end-start)

Time difference of 0.04980111 secs

OP之前的实施是对相同的

“…相对于一些过去点的时间的斜率…”采取0.2666121秒的时间差当

n=3

时，可以精确地计算出您想要计算的内容，例如？例如，您如何获得/导出第三行的

1.25

和

2.25

？您可以输入您的代码并识别瓶颈。我认为它来自于安装了大量的

lm

。还有谷歌：

“r并行循环”

好的，例如：-slopeVar1[3]=1.25来自Var1[1:3]=[1,1.8,3.5]相对于时间[1:3]=[1,2,3]的线性再积分。我如何获得Var2的斜率？