Python中的多元线性回归
我似乎找不到任何做多元回归的python库。我发现的唯一的事情只是简单的回归。我需要将因变量(y)与几个自变量(x1、x2、x3等)进行回归 例如,使用此数据:Python中的多元线性回归,python,numpy,statistics,scipy,linear-regression,Python,Numpy,Statistics,Scipy,Linear Regression,我似乎找不到任何做多元回归的python库。我发现的唯一的事情只是简单的回归。我需要将因变量(y)与几个自变量(x1、x2、x3等)进行回归 例如,使用此数据: print 'y x1 x2 x3 x4 x5 x6 x7' for t in texts: print "{:>7.1f}{:>10.2f}{:>9.2f}{:>9.2f}{:>10.2f}{:>7.2f}{:&g
print 'y x1 x2 x3 x4 x5 x6 x7'
for t in texts:
print "{:>7.1f}{:>10.2f}{:>9.2f}{:>9.2f}{:>10.2f}{:>7.2f}{:>7.2f}{:>9.2f}" /
.format(t.y,t.x1,t.x2,t.x3,t.x4,t.x5,t.x6,t.x7)
(以上的输出:)
我将如何在python中对这些进行回归,以获得线性回归公式:
Y=a1x1+a2x2+a3x3+a4x4+a5x5+a6x6++a7x7+c
我会做到:
from sklearn import linear_model
clf = linear_model.LinearRegression()
clf.fit([[getattr(t, 'x%d' % i) for i in range(1, 8)] for t in texts],
[t.y for t in texts])
然后,clf.coef
将具有回归系数
也有类似的接口来对回归进行各种正则化。您可以使用这里是我创建的一个小工作。我用R检查了一下,它工作正常
import numpy as np
import statsmodels.api as sm
y = [1,2,3,4,3,4,5,4,5,5,4,5,4,5,4,5,6,5,4,5,4,3,4]
x = [
[4,2,3,4,5,4,5,6,7,4,8,9,8,8,6,6,5,5,5,5,5,5,5],
[4,1,2,3,4,5,6,7,5,8,7,8,7,8,7,8,7,7,7,7,7,6,5],
[4,1,2,5,6,7,8,9,7,8,7,8,7,7,7,7,7,7,6,6,4,4,4]
]
def reg_m(y, x):
ones = np.ones(len(x[0]))
X = sm.add_constant(np.column_stack((x[0], ones)))
for ele in x[1:]:
X = sm.add_constant(np.column_stack((ele, X)))
results = sm.OLS(y, X).fit()
return results
结果:
print reg_m(y, x).summary()
[ -0.49104607 0.83271938 0.0860167 0.1326091 6.85681762 22.98163883 -41.08437805 -19.08085066]
[ -5.97751163, -5.06465759, -10.16873217, -4.96959788, -7.96356915, -3.06176313, -6.01818435, -7.90878145, -7.86720264]
输出:
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.535
Model: OLS Adj. R-squared: 0.461
Method: Least Squares F-statistic: 7.281
Date: Tue, 19 Feb 2013 Prob (F-statistic): 0.00191
Time: 21:51:28 Log-Likelihood: -26.025
No. Observations: 23 AIC: 60.05
Df Residuals: 19 BIC: 64.59
Df Model: 3
==============================================================================
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------
x1 0.2424 0.139 1.739 0.098 -0.049 0.534
x2 0.2360 0.149 1.587 0.129 -0.075 0.547
x3 -0.0618 0.145 -0.427 0.674 -0.365 0.241
const 1.5704 0.633 2.481 0.023 0.245 2.895
==============================================================================
Omnibus: 6.904 Durbin-Watson: 1.905
Prob(Omnibus): 0.032 Jarque-Bera (JB): 4.708
Skew: -0.849 Prob(JB): 0.0950
Kurtosis: 4.426 Cond. No. 38.6
pandas
提供了一种方便的运行OLS的方法,如以下答案所示:
您可以使用:
结果:
print reg_m(y, x).summary()
[ -0.49104607 0.83271938 0.0860167 0.1326091 6.85681762 22.98163883 -41.08437805 -19.08085066]
[ -5.97751163, -5.06465759, -10.16873217, -4.96959788, -7.96356915, -3.06176313, -6.01818435, -7.90878145, -7.86720264]
您可以通过以下方式查看估计输出:
print(np.dot(X,beta_hat))
结果:
print reg_m(y, x).summary()
[ -0.49104607 0.83271938 0.0860167 0.1326091 6.85681762 22.98163883 -41.08437805 -19.08085066]
[ -5.97751163, -5.06465759, -10.16873217, -4.96959788, -7.96356915, -3.06176313, -6.01818435, -7.90878145, -7.86720264]
使用
scipy.optimize.curve\u fit
。不仅是线性拟合
from scipy.optimize import curve_fit
import scipy
def fn(x, a, b, c):
return a + b*x[0] + c*x[1]
# y(x0,x1) data:
# x0=0 1 2
# ___________
# x1=0 |0 1 2
# x1=1 |1 2 3
# x1=2 |2 3 4
x = scipy.array([[0,1,2,0,1,2,0,1,2,],[0,0,0,1,1,1,2,2,2]])
y = scipy.array([0,1,2,1,2,3,2,3,4])
popt, pcov = curve_fit(fn, x, y)
print popt
您可以使用下面的函数并向其传递数据帧:
def linear(x, y=None, show=True):
"""
@param x: pd.DataFrame
@param y: pd.DataFrame or pd.Series or None
if None, then use last column of x as y
@param show: if show regression summary
"""
import statsmodels.api as sm
xy = sm.add_constant(x if y is None else pd.concat([x, y], axis=1))
res = sm.OLS(xy.ix[:, -1], xy.ix[:, :-1], missing='drop').fit()
if show: print res.summary()
return res
将数据转换为数据帧(
df
)后
默认情况下,将包含截取项
有关更多示例,请参阅。为了澄清,您给出的示例是多元线性回归,而不是多元线性回归。请参阅: 单标量预测变量x和单标量响应变量y的最简单情况称为简单线性回归。对多个和/或向量值预测变量(用大写字母X表示)的扩展称为多元线性回归,也称为多变量线性回归。几乎所有现实世界的回归模型都涉及多个预测因子,线性回归的基本描述通常用多元回归模型来表述。但是,请注意,在这些情况下,响应变量y仍然是标量。另一个术语多元线性回归是指y是向量的情况,即与一般线性回归相同。多元线性回归和多元线性回归之间的差异应予以强调,因为这会在文献中造成许多混乱和误解 简言之:
- 多元线性回归:响应y是一个标量
- 多元线性回归:响应y是一个向量
(另一个。)我认为这可能是完成这项工作最简单的方法:
from random import random
from pandas import DataFrame
from statsmodels.api import OLS
lr = lambda : [random() for i in range(100)]
x = DataFrame({'x1': lr(), 'x2':lr(), 'x3':lr()})
x['b'] = 1
y = x.x1 + x.x2 * 2 + x.x3 * 3 + 4
print x.head()
x1 x2 x3 b
0 0.433681 0.946723 0.103422 1
1 0.400423 0.527179 0.131674 1
2 0.992441 0.900678 0.360140 1
3 0.413757 0.099319 0.825181 1
4 0.796491 0.862593 0.193554 1
print y.head()
0 6.637392
1 5.849802
2 7.874218
3 7.087938
4 7.102337
dtype: float64
model = OLS(y, x)
result = model.fit()
print result.summary()
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 1.000
Model: OLS Adj. R-squared: 1.000
Method: Least Squares F-statistic: 5.859e+30
Date: Wed, 09 Dec 2015 Prob (F-statistic): 0.00
Time: 15:17:32 Log-Likelihood: 3224.9
No. Observations: 100 AIC: -6442.
Df Residuals: 96 BIC: -6431.
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------
x1 1.0000 8.98e-16 1.11e+15 0.000 1.000 1.000
x2 2.0000 8.28e-16 2.41e+15 0.000 2.000 2.000
x3 3.0000 8.34e-16 3.6e+15 0.000 3.000 3.000
b 4.0000 8.51e-16 4.7e+15 0.000 4.000 4.000
==============================================================================
Omnibus: 7.675 Durbin-Watson: 1.614
Prob(Omnibus): 0.022 Jarque-Bera (JB): 3.118
Skew: 0.045 Prob(JB): 0.210
Kurtosis: 2.140 Cond. No. 6.89
==============================================================================
多元线性回归可以使用上述sklearn库进行处理。我正在使用Python 3.6的Anaconda安装 按如下方式创建模型:
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X, y)
# display coefficients
print(regressor.coef_)
以下是另一种基本方法:
从patsy导入数据矩阵
将statsmodels.api作为sm导入
y、 x=数据矩阵(“y_数据~x_1+x_2”,数据=我的_数据)
###y_data是数据中因变量的名称
模型拟合=sm.OLS(y,x)
结果=模型拟合。拟合()
打印(results.summary())
除了
sm.OLS
之外,您还可以使用sm.Logit
或sm.Probit
等等。Scikit learn是一个Python机器学习库,可以为您完成这项工作。
只需将sklearn.linear_模型模块导入脚本
使用Python中的sklearn查找多元线性回归的代码模板:
import numpy as np
import matplotlib.pyplot as plt #to plot visualizations
import pandas as pd
# Importing the dataset
df = pd.read_csv(<Your-dataset-path>)
# Assigning feature and target variables
X = df.iloc[:,:-1]
y = df.iloc[:,-1]
# Use label encoders, if you have any categorical variable
from sklearn.preprocessing import LabelEncoder
labelencoder = LabelEncoder()
X['<column-name>'] = labelencoder.fit_transform(X['<column-name>'])
from sklearn.preprocessing import OneHotEncoder
onehotencoder = OneHotEncoder(categorical_features = ['<index-value>'])
X = onehotencoder.fit_transform(X).toarray()
# Avoiding the dummy variable trap
X = X[:,1:] # Usually done by the algorithm itself
#Spliting the data into test and train set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state = 0, test_size = 0.2)
# Fitting the model
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)
# Predicting the test set results
y_pred = regressor.predict(X_test)
将numpy导入为np
将matplotlib.pyplot作为plt导入以绘制可视化效果
作为pd进口熊猫
#导入数据集
df=pd.read_csv()
#分配特征和目标变量
X=df.iloc[:,:-1]
y=df.iloc[:,-1]
#如果您有任何分类变量,请使用标签编码器
从sklearn.preprocessing导入标签编码器
labelencoder=labelencoder()
X['']=labelencoder.fit_变换(X[''])
从sklearn.preprocessing导入OneHotEncoder
onehotencoder=onehotencoder(分类功能=[''))
X=onehotcoder.fit_transform(X).toarray()
#避免虚拟变量陷阱
X=X[:,1:]#通常由算法本身完成
#将数据拆分为测试集和列车集
从sklearn.model\u选择导入列车\u测试\u拆分
X_序列,X_测试,y_序列,y_测试=序列测试分割(X,y,随机状态=0,测试大小=0.2)
#拟合模型
从sklearn.linear\u模型导入线性回归
回归器=线性回归()
回归器拟合(X_列,y_列)
#预测测试集结果
y_pred=回归预测(X_检验)
就这样。您可以将此代码用作在任何数据集中实现多元线性回归的模板。
为了更好地理解示例,请访问:查找这样的线性模型可以使用 在OpenTURNS中,这是通过
LinearModelAlgorithm
类完成的,该类从数值样本创建线性模型。更具体地说,它构建了以下线性模型:
Y=a0+a1.X1+…+an.Xn+ε
其中误差ε为高斯分布,均值和单位方差为零。假设您的数据位于csv文件中,下面是一个获取回归系数ai的简单脚本:
from __future__ import print_function
import pandas as pd
import openturns as ot
# Assuming the data is a csv file with the given structure
# Y X1 X2 .. X7
df = pd.read_csv("./data.csv", sep="\s+")
# Build a sample from the pandas dataframe
sample = ot.Sample(df.values)
# The observation points are in the first column (dimension 1)
Y = sample[:, 0]
# The input vector (X1,..,X7) of dimension 7
X = sample[:, 1::]
# Build a Linear model approximation
result = ot.LinearModelAlgorithm(X, Y).getResult()
# Get the coefficients ai
print("coefficients of the linear regression model = ", result.getCoefficients())
然后,您可以通过以下调用轻松获得置信区间:
# Get the confidence intervals at 90% of the ai coefficients
print(
"confidence intervals of the coefficients = ",
ot.LinearModelAnalysis(result).getCoefficientsConfidenceInterval(0.9),
)
您可以在OpenTURNS示例中找到更详细的示例。尝试使用高斯族的广义线性模型
y = np.array([-6, -5, -10, -5, -8, -3, -6, -8, -8])
X = np.array([
[-4.95, -4.55, -10.96, -1.08, -6.52, -0.81, -7.01, -4.46, -11.54],
[-5.87, -4.52, -11.64, -3.36, -7.45, -2.36, -7.33, -7.65, -10.03],
[-0.76, -0.71, -0.98, 0.75, -0.86, -0.50, -0.33, -0.94, -1.03],
[14.73, 13.74, 15.49, 24.72, 16.59, 22.44, 13.93, 11.40, 18.18],
[4.02, 4.47, 4.18, 4.96, 4.29, 4.81, 4.32, 4.43, 4.28],
[0.20, 0.16, 0.19, 0.16, 0.10, 0.15, 0.21, 0.16, 0.21],
[0.45, 0.50, 0.53, 0.60, 0.48, 0.53, 0.50, 0.49, 0.55],
])
X=zip(*reversed(X))
df=pd.DataFrame({'X':X,'y':y})
columns=7
for i in range(0,columns):
df['X'+str(i)]=df.apply(lambda row: row['X'][i],axis=1)
df=df.drop('X',axis=1)
print(df)
#model_formula='y ~ X0+X1+X2+X3+X4+X5+X6'
model_formula='y ~ X0'
model_family = sm.families.Gaussian()
model_fit = glm(formula = model_formula,
data = df,
family = model_family).fit()
print(model_fit.summary())
# Extract coefficients from the fitted model wells_fit
#print(model_fit.params)
intercept, slope = model_fit.params
# Print coefficients
print('Intercept =', intercept)
print('Slope =', slope)
# Extract and print confidence intervals
print(model_fit.conf_int())
df2=pd.DataFrame()
df2['X0']=np.linspace(0.50,0.70,50)
df3=pd.DataFrame()
df3['X1']=np.linspace(0.20,0.60,50)
prediction0=model_fit.predict(df2)
#prediction1=model_fit.predict(df3)
plt.plot(df2['X0'],prediction0,label='X0')
plt.ylabel("y")
plt.xlabel("X0")
plt.show()
不是专家,但是如果变量是独立的,你就不能对每个变量进行简单的回归并对结果求和吗?@HughBothwell你不能假设变量是独立的。事实上,如果假设变量是独立的,则可能会错误地建模数据。换句话说,响应
Y
可能相互关联,但假设独立性并不能准确地建模数据集。@HughBothwell如果这是一个dum问题,很抱歉,但为什么原始特征变量x_i是否独立很重要?这将如何影响预测器(=模型)?这将返回一个错误。还有其他解决方案吗?@Dougal can sklea