Python中的多元线性回归_Python_Numpy_Statistics_Scipy_Linear Regression

Python中的多元线性回归

python numpy statistics

Python中的多元线性回归,python,numpy,statistics,scipy,linear-regression,Python,Numpy,Statistics,Scipy,Linear Regression,我似乎找不到任何做多元回归的python库。我发现的唯一的事情只是简单的回归。我需要将因变量（y）与几个自变量（x1、x2、x3等）进行回归例如，使用此数据： print 'y x1 x2 x3 x4 x5 x6 x7' for t in texts: print "{:>7.1f}{:>10.2f}{:>9.2f}{:>9.2f}{:>10.2f}{:>7.2f}{:&g

我似乎找不到任何做多元回归的python库。我发现的唯一的事情只是简单的回归。我需要将因变量（y）与几个自变量（x1、x2、x3等）进行回归

例如，使用此数据：

print 'y        x1      x2       x3       x4      x5     x6       x7'
for t in texts:
    print "{:>7.1f}{:>10.2f}{:>9.2f}{:>9.2f}{:>10.2f}{:>7.2f}{:>7.2f}{:>9.2f}" /
   .format(t.y,t.x1,t.x2,t.x3,t.x4,t.x5,t.x6,t.x7)

（以上的输出：）

我将如何在python中对这些进行回归，以获得线性回归公式：

Y=a1x1+a2x2+a3x3+a4x4+a5x5+a6x6++a7x7+c

我会做到：

from sklearn import linear_model
clf = linear_model.LinearRegression()
clf.fit([[getattr(t, 'x%d' % i) for i in range(1, 8)] for t in texts],
        [t.y for t in texts])

然后，

clf.coef

将具有回归系数

也有类似的接口来对回归进行各种正则化。

您可以使用

这里是我创建的一个小工作。我用R检查了一下，它工作正常

import numpy as np
import statsmodels.api as sm

y = [1,2,3,4,3,4,5,4,5,5,4,5,4,5,4,5,6,5,4,5,4,3,4]

x = [
     [4,2,3,4,5,4,5,6,7,4,8,9,8,8,6,6,5,5,5,5,5,5,5],
     [4,1,2,3,4,5,6,7,5,8,7,8,7,8,7,8,7,7,7,7,7,6,5],
     [4,1,2,5,6,7,8,9,7,8,7,8,7,7,7,7,7,7,6,6,4,4,4]
     ]

def reg_m(y, x):
    ones = np.ones(len(x[0]))
    X = sm.add_constant(np.column_stack((x[0], ones)))
    for ele in x[1:]:
        X = sm.add_constant(np.column_stack((ele, X)))
    results = sm.OLS(y, X).fit()
    return results

结果:

print reg_m(y, x).summary()

[ -0.49104607   0.83271938   0.0860167    0.1326091    6.85681762  22.98163883 -41.08437805 -19.08085066]

[ -5.97751163,  -5.06465759, -10.16873217,  -4.96959788,  -7.96356915,  -3.06176313,  -6.01818435,  -7.90878145,  -7.86720264]

输出：

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.535
Model:                            OLS   Adj. R-squared:                  0.461
Method:                 Least Squares   F-statistic:                     7.281
Date:                Tue, 19 Feb 2013   Prob (F-statistic):            0.00191
Time:                        21:51:28   Log-Likelihood:                -26.025
No. Observations:                  23   AIC:                             60.05
Df Residuals:                      19   BIC:                             64.59
Df Model:                           3                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
x1             0.2424      0.139      1.739      0.098        -0.049     0.534
x2             0.2360      0.149      1.587      0.129        -0.075     0.547
x3            -0.0618      0.145     -0.427      0.674        -0.365     0.241
const          1.5704      0.633      2.481      0.023         0.245     2.895

==============================================================================
Omnibus:                        6.904   Durbin-Watson:                   1.905
Prob(Omnibus):                  0.032   Jarque-Bera (JB):                4.708
Skew:                          -0.849   Prob(JB):                       0.0950
Kurtosis:                       4.426   Cond. No.                         38.6

pandas

提供了一种方便的运行OLS的方法，如以下答案所示：

您可以使用：

结果:

print reg_m(y, x).summary()

[ -0.49104607   0.83271938   0.0860167    0.1326091    6.85681762  22.98163883 -41.08437805 -19.08085066]

[ -5.97751163,  -5.06465759, -10.16873217,  -4.96959788,  -7.96356915,  -3.06176313,  -6.01818435,  -7.90878145,  -7.86720264]

您可以通过以下方式查看估计输出：

print(np.dot(X,beta_hat))

结果:

print reg_m(y, x).summary()

[ -0.49104607   0.83271938   0.0860167    0.1326091    6.85681762  22.98163883 -41.08437805 -19.08085066]

[ -5.97751163,  -5.06465759, -10.16873217,  -4.96959788,  -7.96356915,  -3.06176313,  -6.01818435,  -7.90878145,  -7.86720264]

使用

scipy.optimize.curve\u fit

。不仅是线性拟合

from scipy.optimize import curve_fit
import scipy

def fn(x, a, b, c):
    return a + b*x[0] + c*x[1]

# y(x0,x1) data:
#    x0=0 1 2
# ___________
# x1=0 |0 1 2
# x1=1 |1 2 3
# x1=2 |2 3 4

x = scipy.array([[0,1,2,0,1,2,0,1,2,],[0,0,0,1,1,1,2,2,2]])
y = scipy.array([0,1,2,1,2,3,2,3,4])
popt, pcov = curve_fit(fn, x, y)
print popt

您可以使用下面的函数并向其传递数据帧：

def linear(x, y=None, show=True):
    """
    @param x: pd.DataFrame
    @param y: pd.DataFrame or pd.Series or None
              if None, then use last column of x as y
    @param show: if show regression summary
    """
    import statsmodels.api as sm

    xy = sm.add_constant(x if y is None else pd.concat([x, y], axis=1))
    res = sm.OLS(xy.ix[:, -1], xy.ix[:, :-1], missing='drop').fit()

    if show: print res.summary()
    return res

将数据转换为数据帧（

df

）后

默认情况下，将包含截取项

有关更多示例，请参阅。

为了澄清，您给出的示例是多元线性回归，而不是多元线性回归。请参阅：

单标量预测变量x和单标量响应变量y的最简单情况称为简单线性回归。对多个和/或向量值预测变量（用大写字母X表示）的扩展称为多元线性回归，也称为多变量线性回归。几乎所有现实世界的回归模型都涉及多个预测因子，线性回归的基本描述通常用多元回归模型来表述。但是，请注意，在这些情况下，响应变量y仍然是标量。另一个术语多元线性回归是指y是向量的情况，即与一般线性回归相同。多元线性回归和多元线性回归之间的差异应予以强调，因为这会在文献中造成许多混乱和误解

简言之：

多元线性回归：响应y是一个标量
多元线性回归：响应y是一个向量

（另一个。）

我认为这可能是完成这项工作最简单的方法：

from random import random
from pandas import DataFrame
from statsmodels.api import OLS
lr = lambda : [random() for i in range(100)]
x = DataFrame({'x1': lr(), 'x2':lr(), 'x3':lr()})
x['b'] = 1
y = x.x1 + x.x2 * 2 + x.x3 * 3 + 4

print x.head()

         x1        x2        x3  b
0  0.433681  0.946723  0.103422  1
1  0.400423  0.527179  0.131674  1
2  0.992441  0.900678  0.360140  1
3  0.413757  0.099319  0.825181  1
4  0.796491  0.862593  0.193554  1

print y.head()

0    6.637392
1    5.849802
2    7.874218
3    7.087938
4    7.102337
dtype: float64

model = OLS(y, x)
result = model.fit()
print result.summary()

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       1.000
Model:                            OLS   Adj. R-squared:                  1.000
Method:                 Least Squares   F-statistic:                 5.859e+30
Date:                Wed, 09 Dec 2015   Prob (F-statistic):               0.00
Time:                        15:17:32   Log-Likelihood:                 3224.9
No. Observations:                 100   AIC:                            -6442.
Df Residuals:                      96   BIC:                            -6431.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
x1             1.0000   8.98e-16   1.11e+15      0.000         1.000     1.000
x2             2.0000   8.28e-16   2.41e+15      0.000         2.000     2.000
x3             3.0000   8.34e-16    3.6e+15      0.000         3.000     3.000
b              4.0000   8.51e-16    4.7e+15      0.000         4.000     4.000
==============================================================================
Omnibus:                        7.675   Durbin-Watson:                   1.614
Prob(Omnibus):                  0.022   Jarque-Bera (JB):                3.118
Skew:                           0.045   Prob(JB):                        0.210
Kurtosis:                       2.140   Cond. No.                         6.89
==============================================================================

多元线性回归可以使用上述sklearn库进行处理。我正在使用Python 3.6的Anaconda安装

按如下方式创建模型：

from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X, y)

# display coefficients
print(regressor.coef_)

以下是另一种基本方法：

从patsy导入数据矩阵
将statsmodels.api作为sm导入
y、 x=数据矩阵（“y_数据~x_1+x_2”，数据=我的_数据）
###y_data是数据中因变量的名称
模型拟合=sm.OLS（y，x）
结果=模型拟合。拟合（）
打印（results.summary（））

除了

sm.OLS

之外，您还可以使用

sm.Logit

或

sm.Probit

等等。

Scikit learn是一个Python机器学习库，可以为您完成这项工作。只需将sklearn.linear_模型模块导入脚本

使用Python中的sklearn查找多元线性回归的代码模板：

import numpy as np
import matplotlib.pyplot as plt #to plot visualizations
import pandas as pd

# Importing the dataset
df = pd.read_csv(<Your-dataset-path>)
# Assigning feature and target variables
X = df.iloc[:,:-1]
y = df.iloc[:,-1]

# Use label encoders, if you have any categorical variable
from sklearn.preprocessing import LabelEncoder
labelencoder = LabelEncoder()
X['<column-name>'] = labelencoder.fit_transform(X['<column-name>'])

from sklearn.preprocessing import OneHotEncoder
onehotencoder = OneHotEncoder(categorical_features = ['<index-value>'])
X = onehotencoder.fit_transform(X).toarray()

# Avoiding the dummy variable trap
X = X[:,1:] # Usually done by the algorithm itself

#Spliting the data into test and train set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state = 0, test_size = 0.2)

# Fitting the model
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)

# Predicting the test set results
y_pred = regressor.predict(X_test)

将numpy导入为np
将matplotlib.pyplot作为plt导入以绘制可视化效果
作为pd进口熊猫
#导入数据集
df=pd.read_csv（）
#分配特征和目标变量
X=df.iloc[：，：-1]
y=df.iloc[：，-1]
#如果您有任何分类变量，请使用标签编码器
从sklearn.preprocessing导入标签编码器
labelencoder=labelencoder（）
X['']=labelencoder.fit_变换（X['']）
从sklearn.preprocessing导入OneHotEncoder
onehotencoder=onehotencoder（分类功能=[''））
X=onehotcoder.fit_transform（X）.toarray（）
#避免虚拟变量陷阱
X=X[：，1:]#通常由算法本身完成
#将数据拆分为测试集和列车集
从sklearn.model\u选择导入列车\u测试\u拆分
X_序列，X_测试，y_序列，y_测试=序列测试分割（X，y，随机状态=0，测试大小=0.2）
#拟合模型
从sklearn.linear\u模型导入线性回归
回归器=线性回归（）
回归器拟合（X_列，y_列）
#预测测试集结果
y_pred=回归预测（X_检验）

就这样。您可以将此代码用作在任何数据集中实现多元线性回归的模板。

为了更好地理解示例，请访问：

查找这样的线性模型可以使用

在OpenTURNS中，这是通过

LinearModelAlgorithm

类完成的，该类从数值样本创建线性模型。更具体地说，它构建了以下线性模型：

Y=a0+a1.X1+…+an.Xn+ε

其中误差ε为高斯分布，均值和单位方差为零。假设您的数据位于csv文件中，下面是一个获取回归系数ai的简单脚本：

from __future__ import print_function
import pandas as pd
import openturns as ot

# Assuming the data is a csv file with the given structure                          
# Y X1 X2 .. X7
df = pd.read_csv("./data.csv", sep="\s+")

# Build a sample from the pandas dataframe
sample = ot.Sample(df.values)

# The observation points are in the first column (dimension 1)
Y = sample[:, 0]

# The input vector (X1,..,X7) of dimension 7
X = sample[:, 1::]

# Build a Linear model approximation
result = ot.LinearModelAlgorithm(X, Y).getResult()

# Get the coefficients ai
print("coefficients of the linear regression model = ", result.getCoefficients())

然后，您可以通过以下调用轻松获得置信区间：

# Get the confidence intervals at 90% of the ai coefficients
print(
    "confidence intervals of the coefficients = ",
    ot.LinearModelAnalysis(result).getCoefficientsConfidenceInterval(0.9),
)

您可以在OpenTURNS示例中找到更详细的示例。

尝试使用高斯族的广义线性模型

y = np.array([-6, -5, -10, -5, -8, -3, -6, -8, -8])
X = np.array([
    [-4.95, -4.55, -10.96, -1.08, -6.52, -0.81, -7.01, -4.46, -11.54],
    [-5.87, -4.52, -11.64, -3.36, -7.45, -2.36, -7.33, -7.65, -10.03],
    [-0.76, -0.71, -0.98, 0.75, -0.86, -0.50, -0.33, -0.94, -1.03],
    [14.73, 13.74, 15.49, 24.72, 16.59, 22.44, 13.93, 11.40, 18.18],
    [4.02, 4.47, 4.18, 4.96, 4.29, 4.81, 4.32, 4.43, 4.28],
    [0.20, 0.16, 0.19, 0.16, 0.10, 0.15, 0.21, 0.16, 0.21],
    [0.45, 0.50, 0.53, 0.60, 0.48, 0.53, 0.50, 0.49, 0.55],
])
X=zip(*reversed(X))

df=pd.DataFrame({'X':X,'y':y})
columns=7
for i in range(0,columns):
    df['X'+str(i)]=df.apply(lambda row: row['X'][i],axis=1)

df=df.drop('X',axis=1)
print(df)


#model_formula='y ~ X0+X1+X2+X3+X4+X5+X6'
model_formula='y ~ X0'

model_family = sm.families.Gaussian()
model_fit = glm(formula = model_formula, 
             data = df, 
             family = model_family).fit()

print(model_fit.summary())

# Extract coefficients from the fitted model wells_fit
#print(model_fit.params)
intercept, slope = model_fit.params

# Print coefficients
print('Intercept =', intercept)
print('Slope =', slope)

# Extract and print confidence intervals
print(model_fit.conf_int())

df2=pd.DataFrame()
df2['X0']=np.linspace(0.50,0.70,50)

df3=pd.DataFrame()
df3['X1']=np.linspace(0.20,0.60,50)

prediction0=model_fit.predict(df2)
#prediction1=model_fit.predict(df3)

plt.plot(df2['X0'],prediction0,label='X0')
plt.ylabel("y")
plt.xlabel("X0")
plt.show()

不是专家，但是如果变量是独立的，你就不能对每个变量进行简单的回归并对结果求和吗？@HughBothwell你不能假设变量是独立的。事实上，如果假设变量是独立的，则可能会错误地建模数据。换句话说，响应

可能相互关联，但假设独立性并不能准确地建模数据集。@HughBothwell如果这是一个dum问题，很抱歉，但为什么原始特征变量x_i是否独立很重要？这将如何影响预测器（=模型）？这将返回一个错误。还有其他解决方案吗？@Dougal can sklea