R中的lm：解决'；对比度'；错误_R_Regression_Lm_Coefficients

R中的lm：解决'；对比度'；错误

R中的lm：解决'；对比度'；错误,r,regression,lm,coefficients,R,Regression,Lm,Coefficients,我正在使用大量数据（5000万行）和biglm包创建一个线性模型。这是通过首先基于数据块创建线性模型来完成的，然后通过读取更多数据块（一百万行）并使用“biglm”中的“update”函数来更新模型。我的模型使用年份（20级因子）、温度和1或0的因子变量。代码如下所示： model = biglm(output~year:is_paid+temp,data = df) #creates my original model from a starting data frame, df newdat

我正在使用大量数据（5000万行）和biglm包创建一个线性模型。这是通过首先基于数据块创建线性模型来完成的，然后通过读取更多数据块（一百万行）并使用“biglm”中的“update”函数来更新模型。我的模型使用年份（20级因子）、温度和1或0的因子变量。代码如下所示：

model = biglm(output~year:is_paid+temp,data = df) #creates my original model from a starting data frame, df
newdata = file[i] #This is just an example of me getting a new chunk of data in; don't worry about it
model = update(model,data = newdata) #this is where the update to the new model with the new data happens

#the variable 'line' is a single line of data that has a '1' for is_paid
newdata = file[i] #again, an example of me reading in a new chunk of data. I know that this doesn't make sense by itself
newdata = rbind(line,newdata) #add in the sample line with '1' in is_paid to newdata
model = update(model,newdata) #update the data

问题是，is_paid factor变量几乎总是0。因此，有时当我读入一块数据时，is_paid列中的每个值都将为0，我显然会得到以下错误：

Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : 
contrasts can be applied only to factors with 2 or more levels

以下是我的数据示例：

output  year    temp is_paid
1100518     12     40   0
2104518     12     29   0   
1100200     15     17   0   
1245110     16     18   0 
5103128     14     30   0

这是我的样本行的一个例子，这是一个真实的记录，其中is_paid为1：

output  year temp is_paid
31200599 12  49     1

在同一行中加一次又一次会扭曲我得到的变量系数吗？我在一些虚拟代码上进行了测试，它看起来不像是用相同的记录反复更新模型，但我对此表示怀疑

我觉得有一种更优雅、更聪明的方法可以做到这一点。我一直在阅读R教程，似乎有一种方法可以设置lm模型的对比度。我看了“lm”中的“对比度”论证，但什么都想不出来。我不认为你可以在biglm中设置对比度，这正是我需要使用的。我非常感谢你们能想到的任何见解或解决方案

*is_的数字变量与系数变量的比较：

df.num = data.frame(a = c(1:10),b = as.factor(rep(c(1,2,3,4,5),each = 2)),c = c(rep(0,each = 5),rep(1,each = 5)))
df.factor = data.frame(a = c(1:10),b = as.factor(rep(c(1,2,3,4,5),each = 2)),c = as.factor(c(rep(0,each = 5),rep(1,each = 5))))

mod.factor = lm(a~b:c,data = df.factor)
mod.num = lm(a~b:c,data = df.num)

> mod.factor

Call:
lm(formula = a ~ b:c, data = df.factor)
Coefficients:
(Intercept)        b1:c0        b2:c0        b3:c0        b4:c0        b5:c0        b1:c1  
    9.5         -8.0         -6.0         -4.5           NA           NA           NA  
  b2:c1        b3:c1        b4:c1        b5:c1  
     NA         -3.5         -2.0           NA  


 Call:
 lm(formula = a ~ b:c, data = df.num)

Coefficients:
(Intercept)         b1:c         b2:c         b3:c         b4:c         b5:c  
    3.0           NA           NA          3.0          4.5          6.5

这里的结论是，如果支付的是数字，则模型会发生变化

****我还稍微编辑了我的模型，以查看两个因素的相互作用，而不仅仅是三个变量。这意味着我不能将“支付”视为一个数字（我认为）

将本·博尔克的评论转化为一个答案，并有证据表明某些更好的模拟数据是有效的

只要把你的两个层次因素当作一个连续的因素。这与将其视为一个因素是一样的

例如：

df.num = data.frame(a = rnorm(12),
                    b = as.factor(rep(1:4,each = 3)),
                    c = rep(0:1, 6))
df.factor = df.num
df.factor$c = factor(df.factor$c)

mod.factor = lm(a~b*c - 1,data = df.factor)
mod.num = lm(a~b*c - 1,data = df.num)

all(coef(mod.factor) == coef(mod.num))
# [1] TRUE

为什么不能将两级因子变量转换成数字（例如，

作为.numeric（f）-1

）？安装的模型将是相同的。我将编辑我写的一个小例子，证明你是正确的。你在这种情况下是正确的，这一事实让我有点困惑。我以为你应该用因子来表示这样的指标变量。这是否仅适用于我仅使用1和0？抱歉，我第一次没有编写正确的模型公式。您的解决方案在公式输出~year+temp+is_pay时有效，但在查看两个因子变量（如我的模型）之间的相互作用时无效。在您的

mod.factor

中，您有10个数据点，一个因子有5个级别，一个因子有2个级别。2*5=10，所以它是单数。这就是为什么您会得到

NA

s。但是Ben Bolker是完全正确的。另外，通过在这两个因素中分别使用

和模拟您的数据，您对b=1
或b=2
没有观察到c=1
。一切都井然有序。太好了，谢谢。你能把你的公式再扩展一点吗，特别是为什么在b*c后面加上“-1”？R公式中的-1
（相当于+0
）表示“不适合截距”。在只有分类变量的回归中，省略截距会使系数与0相比较，而不是与参考水平相比较。您可以从两个模型公式中删除-1
，模型之间的结果仍然相同（以及拟合质量），只有参数的标签和解释会改变。我明白了。非常感谢你的帖子。如果我添加了一个非分类变量，我还能使用“-1”吗？