R中的线性回归与分组_R_Regression_Linear Regression_Lm

R中的线性回归与分组

R中的线性回归与分组,r,regression,linear-regression,lm,R,Regression,Linear Regression,Lm,我想使用lm（）函数在R中进行线性回归。我的数据是一个年度时间序列，其中一个字段表示年份（22年），另一个字段表示州（50个州）。我想为每个状态拟合一个回归，以便在最后得到lm响应向量。我可以想象为每个状态执行for循环，然后在循环中执行回归，并将每个回归的结果添加到向量中。然而，这似乎不太像R。在SAS中，我将执行“by”语句，在SQL中，我将执行“groupby”。做这件事的方法是什么？##伪造数据 ## make fake data ngroups <- 2 group <

我想使用

lm（）

函数在R中进行线性回归。我的数据是一个年度时间序列，其中一个字段表示年份（22年），另一个字段表示州（50个州）。我想为每个状态拟合一个回归，以便在最后得到lm响应向量。我可以想象为每个状态执行for循环，然后在循环中执行回归，并将每个回归的结果添加到向量中。然而，这似乎不太像R。在SAS中，我将执行“by”语句，在SQL中，我将执行“groupby”。做这件事的方法是什么？

##伪造数据
## make fake data
 ngroups <- 2
 group <- 1:ngroups
 nobs <- 100
 dta <- data.frame(group=rep(group,each=nobs),y=rnorm(nobs*ngroups),x=runif(nobs*ngroups))
 head(dta)
#--------------------
  group          y         x
1     1  0.6482007 0.5429575
2     1 -0.4637118 0.7052843
3     1 -0.5129840 0.7312955
4     1 -0.6612649 0.9028034
5     1 -0.5197448 0.1661308
6     1  0.4240346 0.8944253
#------------ 
## function to extract the results of one model
 foo <- function(z) {
   ## coef and se in a data frame
   mr <- data.frame(coef(summary(lm(y~x,data=z))))
   ## put row names (predictors/indep variables)
   mr$predictor <- rownames(mr)
   mr
 }
 ## see that it works
 foo(subset(dta,group==1))
#=========
              Estimate Std..Error   t.value  Pr...t..   predictor
(Intercept)  0.2176477  0.1919140  1.134090 0.2595235 (Intercept)
x           -0.3669890  0.3321875 -1.104765 0.2719666           x
#----------
## one option: use command by
 res <- by(dta,dta$group,foo)
 res
#=========
dta$group: 1
              Estimate Std..Error   t.value  Pr...t..   predictor
(Intercept)  0.2176477  0.1919140  1.134090 0.2595235 (Intercept)
x           -0.3669890  0.3321875 -1.104765 0.2719666           x
------------------------------------------------------------ 
dta$group: 2
               Estimate Std..Error    t.value  Pr...t..   predictor
(Intercept) -0.04039422  0.1682335 -0.2401081 0.8107480 (Intercept)
x            0.06286456  0.3020321  0.2081387 0.8355526           x

## using package plyr is better
 library(plyr)
 res <- ddply(dta,"group",foo)
 res
#----------
  group    Estimate Std..Error    t.value  Pr...t..   predictor
1     1  0.21764767  0.1919140  1.1340897 0.2595235 (Intercept)
2     1 -0.36698898  0.3321875 -1.1047647 0.2719666           x
3     2 -0.04039422  0.1682335 -0.2401081 0.8107480 (Intercept)
4     2  0.06286456  0.3020321  0.2081387 0.8355526           x

n组这里有一种使用lme4
包的方法
 library(lme4)
 d <- data.frame(state=rep(c('NY', 'CA'), c(10, 10)),
                 year=rep(1:10, 2),
                 response=c(rnorm(10), rnorm(10)))

 xyplot(response ~ year, groups=state, data=d, type='l')

 fits <- lmList(response ~ year | state, data=d)
 fits
#------------
Call: lmList(formula = response ~ year | state, data = d)
Coefficients:
   (Intercept)        year
CA -1.34420990  0.17139963
NY  0.00196176 -0.01852429

Degrees of freedom: 20 total; 16 residual
Residual standard error: 0.8201316

库（lme4）
在我看来，混合线性模型是处理此类数据的更好方法。下面给出的代码将影响整体趋势。随机效应表明每个州的趋势如何不同于全球趋势。相关结构考虑了时间自相关。看看Pinheiro&Bates（S和S-Plus中的混合效果模型）
以下是使用该软件包的方法：
d上面的lm（）
函数就是一个简单的例子。顺便说一句，我认为您的数据库中的列如下所示：
年份状态变量1变量2 y
在我看来，您可以使用以下代码：
require(base) 
library(base) 
attach(data) # data = your data base
             #state is your label for the states column
modell<-by(data, data$state, function(data) lm(y~I(1/var1)+I(1/var2)))
summary(modell)

require（基本）
图书馆（基地）
附加（数据）#数据=您的数据库
#state是states列的标签
modell使用数据的一个很好的解决方案。表格由@Zach发布在CrossValidated中。
我要补充的是，也可以迭代地获得回归系数r^2：
## make fake data
    library(data.table)
    set.seed(1)
    dat <- data.table(x=runif(100), y=runif(100), grp=rep(1:2,50))

##calculate the regression coefficient r^2
    dat[,summary(lm(y~x))$r.squared,by=grp]
       grp         V1
    1:   1 0.01465726
    2:   2 0.02256595

自2009年以来，dplyr
已经发布，它实际上提供了一种非常好的方式来进行这种分组，与SAS的功能非常相似
library(dplyr)

d <- data.frame(state=rep(c('NY', 'CA'), c(10, 10)),
                year=rep(1:10, 2),
                response=c(rnorm(10), rnorm(10)))
fitted_models = d %>% group_by(state) %>% do(model = lm(response ~ year, data = .))
# Source: local data frame [2 x 2]
# Groups: <by row>
#
#    state   model
#   (fctr)   (chr)
# 1     CA <S3:lm>
# 2     NY <S3:lm>
fitted_models$model
# [[1]]
# 
# Call:
# lm(formula = response ~ year, data = .)
# 
# Coefficients:
# (Intercept)         year  
#    -0.06354      0.02677  
#
#
# [[2]]
# 
# Call:
# lm(formula = response ~ year, data = .)
# 
# Coefficients:
# (Intercept)         year  
#    -0.35136      0.09385  

我现在的答案来得有点晚，但我正在寻找类似的功能。似乎R中的内置函数“by”也可以轻松进行分组：
？by包含以下示例，适用于每组，并使用sapply提取系数：
require(stats)
## now suppose we want to extract the coefficients by group 
tmp <- with(warpbreaks,
            by(warpbreaks, tension,
               function(x) lm(breaks ~ wool, data = x)))
sapply(tmp, coef)

require（统计信息）
##现在假设我们要按组提取系数
tmp问题似乎是关于如何使用在循环中修改的公式调用回归函数
以下是如何在（使用钻石数据集）中执行此操作：
attach（ggplot2:：diamonds）
strCols=名称（ggplot2:：菱形）
公式我认为值得为这个问题添加purrr:：map
方法
library(tidyverse)

d <- data.frame(state=rep(c('NY', 'CA'), c(10, 10)),
                                 year=rep(1:10, 2),
                                 response=c(rnorm(10), rnorm(10)))

d %>% 
  group_by(state) %>% 
  nest() %>% 
  mutate(model = map(data, ~lm(response ~ year, data = .)))

库（tidyverse）
d%
按（州）分组%>%
嵌套（）%>%
变异（模型=地图（数据，~lm（响应年份，数据=））

请参阅@Paul Hiemstra的答案，以了解有关使用broom
包和这些结果的更多想法。
这是一个非常好的通用统计理论答案，它让我想起了一些我没有考虑过的事情。导致我提出这个问题的应用程序不适用于此解决方案，但我很高兴您提出了它。谢谢。从混合模型开始不是一个好主意-你如何知道任何假设都是有根据的？你应该通过模型验证（以及数据知识）来检查假设。顺便说一句，您也不能保证对单个lm的假设。您必须分别验证所有模型。假设您添加了一个在所有州（即海洋海岸线英里数）都不可用的附加自变量，该自变量在您的数据中由NA表示。lm调用不会失败吗？如何处理？在函数中，您需要针对这种情况进行测试并使用不同的公式是否可以在摘要（最后一步）中的每个调用中添加子组的名称？如果您运行layout（矩阵（c（1,2,3,4），2,2））#可选的4个图形/页面
然后l层（模型，绘图）
您还可以获得每个残差图。是否可以用组标记每个图（例如，在本例中为“状态”）？是否有方法列出这两个模型的R2？e、 g.在年份后添加R2列。还可以为每个系数添加p值？@ToToRo在这里您可以找到一个可行的解决方案（迟做总比不做好）：Your.df[，summary（lm（Y~X））$r.squared，by=Your.factor]，其中：Y，X和Your.factor是您的变量。请记住，您的.df必须是一个data.table classI，我必须执行rowwise（fitted_models）%%>%tidy（model）
，才能使扫帚包正常工作，但除此之外，回答得非常好。非常好。。。可以在不离开管道的情况下完成这一切：d%%>%groupby（state）%%>%do（model=lm（response~year，data=）%%>%rowwise（）%%>%tidy（model）
@pedram和@holastello，至少在R3.6.1、broom\u 0.7.0、dplyr\u 0.8.3中，这不再有效<代码>d%>%group_by（state）%%>%do（model=lm（response~year，data=）%%>%rowwise（）%%>%tidy（model）var错误（如果（is.vector（x）| | is.factor（x））x或者as.double（x），na.rm=na.rm）：对因子x调用var（x）无效。使用类似“all（复制的（x）[-1L]）”的方法测试常量向量。此外：警告消息：1：数据框微调器已弃用，将在即将发布的broom中删除现在（dplyr1.0.4，tidyverse1.3.0），你可以做：库（扫帚）；库（tidyverse）d%%>%nest（data=-state）%%>%mutate（model=map（data，~lm（response~year，data=）），tidied=map（model，tidy））%%unest（tidied）只是想告诉人们，尽管R中有很多group by函数，但并不是所有函数都适合group by回归。例如如果需要一列拟合值或残差，可以进行一些扩展：将lm（）调用包装在resid（）调用中，然后将最后一行中的所有内容通过管道传输到unnest（）调用中。当然，您可能希望将变量名从“model”更改为更相关的名称。
library(dplyr)

d <- data.frame(state=rep(c('NY', 'CA'), c(10, 10)),
                year=rep(1:10, 2),
                response=c(rnorm(10), rnorm(10)))
fitted_models = d %>% group_by(state) %>% do(model = lm(response ~ year, data = .))
# Source: local data frame [2 x 2]
# Groups: <by row>
#
#    state   model
#   (fctr)   (chr)
# 1     CA <S3:lm>
# 2     NY <S3:lm>
fitted_models$model
# [[1]]
# 
# Call:
# lm(formula = response ~ year, data = .)
# 
# Coefficients:
# (Intercept)         year  
#    -0.06354      0.02677  
#
#
# [[2]]
# 
# Call:
# lm(formula = response ~ year, data = .)
# 
# Coefficients:
# (Intercept)         year  
#    -0.35136      0.09385  

library(broom)
fitted_models %>% tidy(model)
# Source: local data frame [4 x 6]
# Groups: state [2]
# 
#    state        term    estimate  std.error  statistic   p.value
#   (fctr)       (chr)       (dbl)      (dbl)      (dbl)     (dbl)
# 1     CA (Intercept) -0.06354035 0.83863054 -0.0757668 0.9414651
# 2     CA        year  0.02677048 0.13515755  0.1980687 0.8479318
# 3     NY (Intercept) -0.35135766 0.60100314 -0.5846187 0.5749166
# 4     NY        year  0.09385309 0.09686043  0.9689519 0.3609470
fitted_models %>% glance(model)
# Source: local data frame [2 x 12]
# Groups: state [2]
# 
#    state   r.squared adj.r.squared     sigma statistic   p.value    df
#   (fctr)       (dbl)         (dbl)     (dbl)     (dbl)     (dbl) (int)
# 1     CA 0.004879969  -0.119510035 1.2276294 0.0392312 0.8479318     2
# 2     NY 0.105032068  -0.006838924 0.8797785 0.9388678 0.3609470     2
# Variables not shown: logLik (dbl), AIC (dbl), BIC (dbl), deviance (dbl),
#   df.residual (int)
fitted_models %>% augment(model)
# Source: local data frame [20 x 10]
# Groups: state [2]
# 
#     state   response  year      .fitted   .se.fit     .resid      .hat
#    (fctr)      (dbl) (int)        (dbl)     (dbl)      (dbl)     (dbl)
# 1      CA  0.4547765     1 -0.036769875 0.7215439  0.4915464 0.3454545
# 2      CA  0.1217003     2 -0.009999399 0.6119518  0.1316997 0.2484848
# 3      CA -0.6153836     3  0.016771076 0.5146646 -0.6321546 0.1757576
# 4      CA -0.9978060     4  0.043541551 0.4379605 -1.0413476 0.1272727
# 5      CA  2.1385614     5  0.070312027 0.3940486  2.0682494 0.1030303
# 6      CA -0.3924598     6  0.097082502 0.3940486 -0.4895423 0.1030303
# 7      CA -0.5918738     7  0.123852977 0.4379605 -0.7157268 0.1272727
# 8      CA  0.4671346     8  0.150623453 0.5146646  0.3165112 0.1757576
# 9      CA -1.4958726     9  0.177393928 0.6119518 -1.6732666 0.2484848
# 10     CA  1.7481956    10  0.204164404 0.7215439  1.5440312 0.3454545
# 11     NY -0.6285230     1 -0.257504572 0.5170932 -0.3710185 0.3454545
# 12     NY  1.0566099     2 -0.163651479 0.4385542  1.2202614 0.2484848
# 13     NY -0.5274693     3 -0.069798386 0.3688335 -0.4576709 0.1757576
# 14     NY  0.6097983     4  0.024054706 0.3138637  0.5857436 0.1272727
# 15     NY -1.5511940     5  0.117907799 0.2823942 -1.6691018 0.1030303
# 16     NY  0.7440243     6  0.211760892 0.2823942  0.5322634 0.1030303
# 17     NY  0.1054719     7  0.305613984 0.3138637 -0.2001421 0.1272727
# 18     NY  0.7513057     8  0.399467077 0.3688335  0.3518387 0.1757576
# 19     NY -0.1271655     9  0.493320170 0.4385542 -0.6204857 0.2484848
# 20     NY  1.2154852    10  0.587173262 0.5170932  0.6283119 0.3454545
# Variables not shown: .sigma (dbl), .cooksd (dbl), .std.resid (dbl)

require(stats)
## now suppose we want to extract the coefficients by group 
tmp <- with(warpbreaks,
            by(warpbreaks, tension,
               function(x) lm(breaks ~ wool, data = x)))
sapply(tmp, coef)

attach(ggplot2::diamonds)
strCols = names(ggplot2::diamonds)

formula <- list(); model <- list()
for (i in 1:1) {
  formula[[i]] = paste0(strCols[7], " ~ ", strCols[7+i])
  model[[i]] = glm(formula[[i]]) 

  #then you can plot the results or anything else ...
  png(filename = sprintf("diamonds_price=glm(%s).png", strCols[7+i]))
  par(mfrow = c(2, 2))      
  plot(model[[i]])
  dev.off()
  }

library(tidyverse)

d <- data.frame(state=rep(c('NY', 'CA'), c(10, 10)),
                                 year=rep(1:10, 2),
                                 response=c(rnorm(10), rnorm(10)))

d %>% 
  group_by(state) %>% 
  nest() %>% 
  mutate(model = map(data, ~lm(response ~ year, data = .)))