如何在Python中实现多元线性回归？_Python_Pandas_Machine Learning_Statistics_Regression

如何在Python中实现多元线性回归？

python pandas machine-learning statistics

如何在Python中实现多元线性回归？,python,pandas,machine-learning,statistics,regression,Python,Pandas,Machine Learning,Statistics,Regression,我正试图从头开始写一个多元线性回归模型来预测影响Facebook上歌曲浏览量的关键因素。关于每首歌，我们收集这些信息，即我使用的变量： df.dtypes clicked int64 listened_5s int64 listened_20s int64 views int64 percentage_listened flo

我正试图从头开始写一个多元线性回归模型来预测影响Facebook上歌曲浏览量的关键因素。关于每首歌，我们收集这些信息，即我使用的变量：

df.dtypes
clicked                      int64
listened_5s                  int64
listened_20s                 int64
views                        int64
percentage_listened          float64
reactions_total              int64
shared_songs                 int64
comments                     int64
avg_time_listened            int64
song_length                  int64
likes                        int64
listened_later               int64

我使用视图数作为因变量，使用数据集中的所有其他变量作为独立变量。模型发布如下：

  #df_x --> new dataframe of independent variables
  df_x = df.drop(['views'], 1)

  #df_y --> new dataframe of dependent variable views
  df_y = df.ix[:, ['views']]

  names = [i for i in list(df_x)]

  regr = linear_model.LinearRegression()
  x_train, x_test, y_train, y_test = train_test_split(df_x, df_y, test_size = 0.2)

   #Fitting the model to the training dataset
   regr.fit(x_train,y_train)
   regr.intercept_
   print('Coefficients: \n', regr.coef_)
   print("Mean Squared Error(MSE): %.2f"
         % np.mean((regr.predict(x_test) - y_test) ** 2))
   print('Variance Score: %.2f' % regr.score(x_test, y_test))
   regr.coef_[0].tolist()

此处输出：

 regr.intercept_
 array([-1173904.20950487])
 MSE: 19722838329246.82
 Variance Score: 0.99

看来出了严重的问题

尝试OLS模型：

   import statsmodels.api as sm
   from statsmodels.sandbox.regression.predstd import wls_prediction_std
   model=sm.OLS(y_train,x_train)
   result = model.fit()
   print(result.summary())

输出：

     R-squared:                       0.992
     F-statistic:                     6121.   

                      coef        std err      t      P>|t|      [95.0% Conf. Int.]


clicked                0.3333      0.012     28.257      0.000         0.310     0.356
listened_5s            -0.4516      0.115    -3.944      0.000        -0.677    -0.227
listened_20s           1.9015      0.138     13.819      0.000         1.631     2.172
percentage_listened    7693.2520   1.44e+04   0.534      0.594     -2.06e+04   3.6e+04
reactions_total        8.6680      3.561      2.434      0.015         1.672    15.664
shared_songs         -36.6376      3.688     -9.934      0.000       -43.884   -29.392
comments              34.9031      5.921      5.895      0.000        23.270    46.536
avg_time_listened    1.702e+05   4.22e+04     4.032      0.000      8.72e+04  2.53e+05
song_length         -6309.8021   5425.543    -1.163      0.245      -1.7e+04  4349.413
likes                  4.8448      4.194      1.155      0.249        -3.395    13.085
listened_later        -2.3761      0.160    -14.831      0.000        -2.691    -2.061


Omnibus:                      233.399   Durbin-Watson:                   
1.983
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             
2859.005
Skew:                           1.621   Prob(JB):                         
0.00
Kurtosis:                      14.020   Cond. No.                     
2.73e+07

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.73e+07. This might indicate that there are strong multicollinearity or other numerical problems.

仅仅通过查看这个输出，看起来有些事情出了严重的问题

我相信训练/测试集和创建两个不同的数据帧x和y出现了一些问题，但无法找出原因。这个问题必须用多元回归来解决。它不是线性的吗？您能帮我找出哪里出了问题吗？

您使用的大多数专栏似乎都是被浏览的“后遗症”。@VivekKumar那么建议是什么？不是多元线性回归？我不太明白你说的“多元线性回归”是什么意思？在上面的评论中，我的意思是，也许你的大部分数据都与你在statsmodel中发现的数据相关（可能是因为它们都依赖于“视图”），如果你喜欢音频的内容、内容、艺术家、流派等，我建议你使用其他功能。不幸的是，我没有这样的信息。它提供给我的数据集是真实的，我必须找到影响视图的关键因素（代表歌曲视频在用户新闻提要中出现的次数）。你是对的，其他变量可以被视为后效应，然而，在社交网络中，分享或类似的内容也会影响我的朋友在他们的新闻提要中看到歌曲的视频。意味着更多的人喜欢——更多的人观看；更多评论-更多查看。现在的问题是，根据这组数据，我如何找到关键贡献者？您会使用什么？然后首先尝试标准化数据，然后使用线性回归或决策树回归器。在线性回归中，

系数

将为您提供特征重要性。在DecisionTreeRegressor中，使用

feature\u importances\u

属性。同时，在网站上发布此信息可能会有更多帮助。