利用lme4软件包在R:大数据集中提取系数的方法_R_Matrix_Large Data_Lme4

利用lme4软件包在R:大数据集中提取系数的方法

r matrix

利用lme4软件包在R:大数据集中提取系数的方法,r,matrix,large-data,lme4,R,Matrix,Large Data,Lme4,我使用lme4软件包制作一个功能性的线性混合效应模型。该过程包括创建样条线的基础，然后将其用作模型中的变量我使用的数据集由3654090个观测值组成，我的模型如下：参数~0+X1+X2+X3+区域+（0+X1|ANNI）+（0+X2|ANNI）+（0+X3|ANNI）+（0+X1|ID）+（0+X2|ID）+（0+X3|ID）其中X1、X2、X3为3条样条曲线，ZONE、ANNI、ID为分类变量。 ZONE和ANNI有5个级别，ID有89942个级别。模型收敛，但当我尝试外推系数时，我

我使用lme4软件包制作一个功能性的线性混合效应模型。该过程包括创建样条线的基础，然后将其用作模型中的变量

我使用的数据集由3654090个观测值组成，我的模型如下：

参数~0+X1+X2+X3+区域+（0+X1|ANNI）+（0+X2|ANNI）+（0+X3|ANNI）+（0+X1|ID）+（0+X2|ID）+（0+X3|ID）

其中X1、X2、X3为3条样条曲线，ZONE、ANNI、ID为分类变量。 ZONE和ANNI有5个级别，ID有89942个级别。模型收敛，但当我尝试外推系数时，我有以下错误信息：

本地错误（a，b，…）：文件../Core/Cholmod_memory.c第334行的Cholmod错误“问题太大”

我知道问题在于随机效应矩阵（269841 x 3654090）

我的问题是：

有没有一种方法可以从非常大的数据集中提取系数

如果不是，我可以用矩阵的最大维数来避免这个问题

我在一台有8GB内存的Windows计算机上使用R3.6.3和RStudio 1.2.5033。我还尝试在一个有64GB内存的Linux虚拟机上运行我的代码，但它没有解决问题

我还为您构建了一个示例：

library(lme4)
library(splines)
library(tidyverse)

memory.limit(100000)

parabola<-function(a,b,c,x){
  a*x^2+b*x+c
}

#time interval
x <- seq(2,28,length.out = 100)

#I extract only 5 value for every ID (15000)
t<-matrix(0,nrow = 5,ncol = 15000)
set.seed(123)
for(i in 1:15000){
t[,i]=sample(x,5)
}

#creation of first case
y1<-matrix(0,nrow = 5,ncol = 15000)
set.seed(123)
for(i in 1:8000){
  a<-sample(seq(-0.01,0.01,length.out = 50),1)
  b<-sample(seq(-0.50,0.50,length.out = 50),1)
  c<-sample(seq(0,1,length.out=50),1)
  for(j in 1:5){
    y1[j,i]=parabola(a,b,c,t[j,i])
  }
}

#creation of second case
y2<-matrix(0,nrow = 5,ncol = 15000)
set.seed(123)
for(i in 1:8000){
  a<-sample(seq(-0.01,0.01,length.out = 50),1)
  b<-sample(seq(-0.50,0.50,length.out = 50),1)
  c<-sample(seq(1.5,2.5,length.out=50),1)
  for(j in 1:5){
    y2[j,i]=parabola(a,b,c,t[j,i])
  }
}

#creation of third case
y3<-matrix(0,nrow = 5,ncol = 15000)
set.seed(123)
for(i in 1:8000){
  a<-sample(seq(-0.01,0.01,length.out = 50),1)
  b<-sample(seq(-0.50,0.50,length.out = 50),1)
  c<-sample(seq(3,4,length.out=50),1)
  for(j in 1:5){
    y3[j,i]=parabola(a,b,c,t[j,i])
  }
}

#creation of forth case
y4<-matrix(0,nrow = 5,ncol = 15000)
set.seed(123)
for(i in 1:8000){
  a<-sample(seq(-0.01,0.01,length.out = 50),1)
  b<-sample(seq(-0.50,0.50,length.out = 50),1)
  c<-sample(seq(3.5,4.5,length.out=50),1)
  for(j in 1:5){
    y4[j,i]=parabola(a,b,c,t[j,i])
  }
}

# creation of the dataset
t<- t %>% as.data.frame() %>%  gather(value="time")
y1 <- y1 %>%
  as.data.frame() %>%
  gather(value="y") %>%
  add_column(case=rep('A',75000))

y2 <- y2 %>%
  as.data.frame() %>%
  gather(value="y") %>% 
  add_column(case=rep('B',75000))

y3 <- y3 %>%
  as.data.frame() %>%
  gather(value="y") %>% 
  add_column(case=rep('C',75000))

y4 <- y4 %>%
  as.data.frame() %>%
  gather(value="y") %>% 
  add_column(case=rep('D',75000))


# final dataset
X<-rbind(cbind(t,y1)[,-3],cbind(t,y2)[,-3],cbind(t,y3)[,-3],cbind(t,y4)[,-3])

# model
k=3
d=2

mean_t <- X$time #times
sort_t_list  <- sort(mean_t ,index.return=TRUE)
sort_t<- sort_t_list [[1]] # ordered times
sort_t_ind  <- sort_t_list [[2]] # indeces for permuting the original variable accordingly



spline <- bs(sort_t ,df=k,degree=d,intercept = T)

data <- data.frame(spline,Parameter=X$y[sort_t_ind ],case=as.factor(X$case[sort_t_ind]),ID=X$key[sort_t_ind]) # create a new dataframe for the regression model of the first variable

set.seed(1993)
fit_k3d2 <- lmer(Parameter~0+X1+X2+X3+(0+X1|case)+(0+X2|case)+(0+X3|case)+(0+X1|ID)+(0+X2|ID)+(0+X3|ID),
                 data=data,
                 control=lmerControl(optCtrl=list(xtol_abs=1e-10, ftol_abs=1e-10)),
                 na.action=na.exclude,
                 verbose = T)

# this command gives the error
coef_id<-coef(fit_k3d2)$ID

库（lme4）
库（样条曲线）
图书馆（tidyverse）
内存限制（100000）
抛物线