R中GLM的并行化循环_R_Parallel Processing_Glm_Parallel Foreach

R中GLM的并行化循环

r parallel-processing

R中GLM的并行化循环,r,parallel-processing,glm,parallel-foreach,R,Parallel Processing,Glm,Parallel Foreach,我正在尝试编程一个并行for循环，在这个循环中，我试图以最佳方式找到最佳GLM，仅对p值最低的变量进行建模，以确定我是否要打网球（二进制中的是/否）例如，我有一个包含气象数据集的表（及其数据帧）。我通过观察哪一个模型的p值最低来构建GLM模型 PlayTennis ~ Precip PlayTennis ~ Temp, PlayTennis ~ Relative_Humidity PlayTennis ~ WindSpeed) 比如说，playneting~Precip的p值最低。因此，r

我正在尝试编程一个并行for循环，在这个循环中，我试图以最佳方式找到最佳GLM，仅对p值最低的变量进行建模，以确定我是否要打网球（二进制中的是/否）

例如，我有一个包含气象数据集的表（及其数据帧）。我通过观察哪一个模型的p值最低来构建GLM模型

PlayTennis ~ Precip
PlayTennis ~ Temp, 
PlayTennis ~ Relative_Humidity
PlayTennis ~ WindSpeed)

比如说，

playneting~Precip

的p值最低。因此，repeat中的下一个循环迭代是查看哪个其他变量的p值最低

PlayTennis ~ Precip + Temp
PlayTennis ~ Precip + Relative_Humidity 
PlayTennis ~ Precip + WindSpeed

这将持续下去，直到没有更重要的变量（p值大于0.05）。因此，我们得到了playneting~Precip+WindSpeed的最终输出（这都是假设的）

对于如何在不同的内核上并行化这段代码，有什么建议吗？我在库speedglm中遇到了一个名为

speedglm

的glm新函数。这确实有所改善，但改善不多。我还研究了

foreach

循环，但我不确定它如何与每个线程进行通信，以了解在各种运行中哪个p值更大或更低。提前感谢您的帮助

d =

Time          Precip    Temp    Relative_Humidity   WindSpeed   …   PlayTennis    
1/1/2000 0:00   0        88           30                0              1    
1/1/2000 1:00   0        80           30                1              1    
1/1/2000 2:00   0        70           44                0              1    
1/1/2000 3:00   0        75           49               10              0    
1/1/2000 4:00   0.78     64           99               15              0    
1/1/2000 5:00   0.01     66           97               15              0    
1/1/2000 6:00   0        74           88                8              0    
1/1/2000 7:00   0        77           82                1              1    
1/1/2000 8:00   0        78           70                1              1    
1/1/2000 9:00   0        79           71                1              1

我拥有的代码如下：

newNames <- names(d)
FRM <- "PlayTennis ~" 

repeat
{
    for (i in 1:length(newNames))
    {
        frm <- as.formula(paste(FRM, newNames[i], sep =""))
        GLM <- glm(formula = frm, na.action = na.exclude, # exclude NA values where they exist
                    data = d, family = binomial())
        # GLM <- speedglm(formula = frm, na.action = na.exclude, # exclude NA values where they exist
        #                 data = d, family = binomial())

        temp <- coef(summary(GLM))[,4][counter]

        if (i == 1) # assign min p value, location, and variable name to the first iteration
        {
            MIN <- temp
            LOC <- i
            VAR <- newNames[i]
        }

        if (temp < MIN) # adjust the min p value accordingly
        {
            MIN <- temp
            LOC <- i
            VAR <- newNames[i]
        }
    }

    if(MIN > 0.05) # break out of the repeat loop when the p-value > 0.05
    {
        break
    }

    FRM <- paste(FRM, VAR, " + ", sep = "") # create new formula
    newNames <- newNames[which(newNames != VAR)] # removes variable that is the most significant
    counter <- counter + 1
}

newNames？step
或add1
可能会在并行化之前增加速度增益此数据来自何处？它是内置于包中的吗？如果您试图进行变量选择，请不要这样做。或者更确切地说，不要这样做。使用正则化方法，如弹性网。你可以使用<代码> GLMNET < /代码>包。如果必须的话，可以考虑 GulMule包。@ USER 20650，我已经尝试过STEP函数，但是它并没有导致一个包含重要变量的模型。这就是为什么我编写了我自己的“手动”但自动的方式来找到最好的模型，但这是相当耗时的，我想看看如果可能的话，我是否可以将其扩展到几个内核上。
newNames <- names(d)
FRM <- "PlayTennis ~" 

repeat
{
    foreach (i = 1:length(newNames)) %dopar%
    {
        frm <- as.formula(paste(FRM, newNames[i], sep =""))
        GLM <- glm(formula = frm, na.action = na.exclude, # exclude NA values where they exist
                    data = d, family = binomial())
        # GLM <- speedglm(formula = frm, na.action = na.exclude, # exclude NA values where they exist
        #                 data = d, family = binomial())

        temp <- coef(summary(GLM))[,4][counter]

        if (i == 1) # assign min p value, location, and variable name to the first iteration
        {
            MIN <- temp
            LOC <- i
            VAR <- newNames[i]
        }

        if (temp < MIN) # adjust the min p value accordingly
        {
            MIN <- temp
            LOC <- i
            VAR <- newNames[i]
        }
    }

    if(MIN > 0.05) # break out of the repeat loop when the p-value > 0.05
    {
        break
    }

    FRM <- paste(FRM, VAR, " + ", sep = "") # create new formula
    newNames <- newNames[which(newNames != VAR)] # removes variable that is the most significant
    counter <- counter + 1
}