python,使用logistic回归来查看哪个变量增加了正向预测的权重

python,使用logistic回归来查看哪个变量增加了正向预测的权重,python,machine-learning,scikit-learn,logistic-regression,statsmodels,Python,Machine Learning,Scikit Learn,Logistic Regression,Statsmodels,所以我有一个银行数据集,我必须预测客户是否会接受定期存款。 我有一个专栏叫job;这是分类的,有每个客户的工作类型。 我目前正在EDA过程中,希望了解哪种工作类别对积极预测的贡献最大 我打算用逻辑回归(不确定这是否正确,欢迎提供替代方法建议) 这就是我所做的 我为每个工作类别做了一个k-hot编码(每个工作类型有1-0个值),目标是 k-1是否为1热,目标值是否为1-0?是(1=客户接受定期存款,0=客户不接受) 目标列如下所示 0 0 1 0 2 0

所以我有一个银行数据集,我必须预测客户是否会接受定期存款。 我有一个专栏叫job;这是分类的,有每个客户的工作类型。 我目前正在EDA过程中,希望了解哪种工作类别对积极预测的贡献最大

我打算用逻辑回归(不确定这是否正确,欢迎提供替代方法建议)

这就是我所做的

我为每个工作类别做了一个k-hot编码(每个工作类型有1-0个值),目标是 k-1是否为1热,目标值是否为1-0?是(1=客户接受定期存款,0=客户不接受)

目标列如下所示

0        0
1        0
2        0
3        0
4        0
        ..
45206    1
45207    1
45208    1
45209    0
45210    0
Name: Target_yes, Length: 45211, dtype: int32
我将其与sklearn logistic回归模型进行拟合,得到了系数。由于无法解释它们,我寻找了一个替代方案,发现了统计模型的版本。对logit函数也做了同样的操作。在我在线看到的示例中,他对x变量使用了sm.get_常量

from sklearn.linear_model import LogisticRegression
from sklearn import metrics
model = LogisticRegression(solver='liblinear')
model.fit(vari,tgt)
model.score(vari,tgt)
df = pd.DataFrame(model.coef_)
df['inter'] = model.intercept_
print(df)
模型得分和print()df结果如下:

0.8830151954170445(model score)

print(df)
          0         1         2         3         4         5         6  \
0 -0.040404 -0.289274 -0.604957 -0.748797 -0.206201  0.573717 -0.177778   

          7         8         9        10        11     inter  
0 -0.530802 -0.210549  0.099326 -0.539109  0.879504 -1.795323 
当我使用sm.get_constats时,我得到的系数类似于sklearn LogisticRecession,但Zscores(我打算用它来找到对积极预测贡献最大的工作类型)变为nan

import statsmodels.api as sm
logit = sm.Logit(tgt, sm.add_constant(vari)).fit()
logit.summary2()
结果如下:

E:\Programs\Anaconda\lib\site-packages\numpy\core\fromnumeric.py:2495: FutureWarning:

Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.

E:\Programs\Anaconda\lib\site-packages\statsmodels\base\model.py:1286: RuntimeWarning:

invalid value encountered in sqrt

E:\Programs\Anaconda\lib\site-packages\scipy\stats\_distn_infrastructure.py:901: RuntimeWarning:

invalid value encountered in greater

E:\Programs\Anaconda\lib\site-packages\scipy\stats\_distn_infrastructure.py:901: RuntimeWarning:

invalid value encountered in less

E:\Programs\Anaconda\lib\site-packages\scipy\stats\_distn_infrastructure.py:1892: RuntimeWarning:

invalid value encountered in less_equal

Optimization terminated successfully.
         Current function value: 0.352610
         Iterations 13

Model:  Logit   Pseudo R-squared:   0.023
Dependent Variable:     Target_yes  AIC:    31907.6785
Date:   2019-11-18 10:17    BIC:    32012.3076
No. Observations:   45211   Log-Likelihood:     -15942.
Df Model:   11  LL-Null:    -16315.
Df Residuals:   45199   LLR p-value:    3.9218e-153
Converged:  1.0000  Scale:  1.0000
No. Iterations:     13.0000         
                  Coef.     Std.Err.    z   P>|z|   [0.025  0.975]
const            -1.7968    nan     nan     nan     nan     nan
job_management   -0.0390    nan     nan     nan     nan     nan
job_technician   -0.2882    nan     nan     nan     nan     nan
job_entrepreneur -0.6092    nan     nan     nan     nan     nan
job_blue-collar  -0.7484    nan     nan     nan     nan     nan
job_unknown      -0.2142    nan     nan     nan     nan     nan
job_retired       0.5766    nan     nan     nan     nan     nan
job_admin.       -0.1766    nan     nan     nan     nan     nan
job_services     -0.5312    nan     nan     nan     nan     nan
job_self-employed   -0.2106     nan     nan     nan     nan     nan
job_unemployed  0.1011  nan     nan     nan     nan     nan
job_housemaid   -0.5427     nan     nan     nan     nan     nan
job_student     0.8857  nan     nan     nan     nan     nan
Optimization terminated successfully.
         Current function value: 0.352610
         Iterations 6

Model:  Logit   Pseudo R-squared:   0.023
Dependent Variable:     Target_yes  AIC:    31907.6785
Date:   2019-11-18 10:18    BIC:    32012.3076
No. Observations:   45211   Log-Likelihood:     -15942.
Df Model:   11  LL-Null:    -16315.
Df Residuals:   45199   LLR p-value:    3.9218e-153
Converged:  1.0000  Scale:  1.0000
No. Iterations:     6.0000      
                  Coef.     Std.Err.    z        P>|z|  [0.025  0.975]
job_management  -1.8357     0.0299  -61.4917    0.0000  -1.8943     -1.7772
job_technician  -2.0849     0.0366  -56.9885    0.0000  -2.1566     -2.0132
job_entrepreneur -2.4060    0.0941  -25.5563    0.0000  -2.5905     -2.2215
job_blue-collar  -2.5452    0.0390  -65.2134    0.0000  -2.6217     -2.4687
job_unknown      -2.0110    0.1826  -11.0120    0.0000  -2.3689     -1.6531
job_retired      -1.2201    0.0501  -24.3534    0.0000  -1.3183     -1.1219
job_admin.       -1.9734    0.0425  -46.4478    0.0000  -2.0566     -1.8901
job_services     -2.3280    0.0545  -42.6871    0.0000  -2.4349     -2.2211
job_self-employed-2.0074    0.0779  -25.7739    0.0000  -2.1600     -1.8547
job_unemployed   -1.6957    0.0765  -22.1538    0.0000  -1.8457     -1.5457
job_housemaid    -2.3395    0.1003  -23.3270    0.0000  -2.5361     -2.1429
job_student      -0.9111    0.0722  -12.6195    0.0000  -1.0526     -0.7696
如果我使用统计模型logit而不使用sm.get_constats,我得到的系数与sklearn Logistic回归非常不同,但我得到的是zscore的值(都是负值)

结果如下:

E:\Programs\Anaconda\lib\site-packages\numpy\core\fromnumeric.py:2495: FutureWarning:

Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.

E:\Programs\Anaconda\lib\site-packages\statsmodels\base\model.py:1286: RuntimeWarning:

invalid value encountered in sqrt

E:\Programs\Anaconda\lib\site-packages\scipy\stats\_distn_infrastructure.py:901: RuntimeWarning:

invalid value encountered in greater

E:\Programs\Anaconda\lib\site-packages\scipy\stats\_distn_infrastructure.py:901: RuntimeWarning:

invalid value encountered in less

E:\Programs\Anaconda\lib\site-packages\scipy\stats\_distn_infrastructure.py:1892: RuntimeWarning:

invalid value encountered in less_equal

Optimization terminated successfully.
         Current function value: 0.352610
         Iterations 13

Model:  Logit   Pseudo R-squared:   0.023
Dependent Variable:     Target_yes  AIC:    31907.6785
Date:   2019-11-18 10:17    BIC:    32012.3076
No. Observations:   45211   Log-Likelihood:     -15942.
Df Model:   11  LL-Null:    -16315.
Df Residuals:   45199   LLR p-value:    3.9218e-153
Converged:  1.0000  Scale:  1.0000
No. Iterations:     13.0000         
                  Coef.     Std.Err.    z   P>|z|   [0.025  0.975]
const            -1.7968    nan     nan     nan     nan     nan
job_management   -0.0390    nan     nan     nan     nan     nan
job_technician   -0.2882    nan     nan     nan     nan     nan
job_entrepreneur -0.6092    nan     nan     nan     nan     nan
job_blue-collar  -0.7484    nan     nan     nan     nan     nan
job_unknown      -0.2142    nan     nan     nan     nan     nan
job_retired       0.5766    nan     nan     nan     nan     nan
job_admin.       -0.1766    nan     nan     nan     nan     nan
job_services     -0.5312    nan     nan     nan     nan     nan
job_self-employed   -0.2106     nan     nan     nan     nan     nan
job_unemployed  0.1011  nan     nan     nan     nan     nan
job_housemaid   -0.5427     nan     nan     nan     nan     nan
job_student     0.8857  nan     nan     nan     nan     nan
Optimization terminated successfully.
         Current function value: 0.352610
         Iterations 6

Model:  Logit   Pseudo R-squared:   0.023
Dependent Variable:     Target_yes  AIC:    31907.6785
Date:   2019-11-18 10:18    BIC:    32012.3076
No. Observations:   45211   Log-Likelihood:     -15942.
Df Model:   11  LL-Null:    -16315.
Df Residuals:   45199   LLR p-value:    3.9218e-153
Converged:  1.0000  Scale:  1.0000
No. Iterations:     6.0000      
                  Coef.     Std.Err.    z        P>|z|  [0.025  0.975]
job_management  -1.8357     0.0299  -61.4917    0.0000  -1.8943     -1.7772
job_technician  -2.0849     0.0366  -56.9885    0.0000  -2.1566     -2.0132
job_entrepreneur -2.4060    0.0941  -25.5563    0.0000  -2.5905     -2.2215
job_blue-collar  -2.5452    0.0390  -65.2134    0.0000  -2.6217     -2.4687
job_unknown      -2.0110    0.1826  -11.0120    0.0000  -2.3689     -1.6531
job_retired      -1.2201    0.0501  -24.3534    0.0000  -1.3183     -1.1219
job_admin.       -1.9734    0.0425  -46.4478    0.0000  -2.0566     -1.8901
job_services     -2.3280    0.0545  -42.6871    0.0000  -2.4349     -2.2211
job_self-employed-2.0074    0.0779  -25.7739    0.0000  -2.1600     -1.8547
job_unemployed   -1.6957    0.0765  -22.1538    0.0000  -1.8457     -1.5457
job_housemaid    -2.3395    0.1003  -23.3270    0.0000  -2.5361     -2.1429
job_student      -0.9111    0.0722  -12.6195    0.0000  -1.0526     -0.7696
这两个哪一个更好? 或者我应该使用完全不同的方法

我将其与sklearn逻辑回归模型进行拟合,得到 系数。由于无法解释,我寻找了一个替代方案 遇到了统计模型版本

解释的工作原理如下: 对数赔率的幂运算给出变量增加一个单位的赔率比。因此,例如,如果Target_yes(1=客户接受定期存款,0=客户不接受)=1,逻辑回归系数为0.573717,那么您可以断言您的“接受”结果的概率是exp(0.573717)=1.7748519304802倍于您的“不接受”结果的概率

我将其与sklearn逻辑回归模型进行拟合,得到 系数。由于无法解释,我寻找了一个替代方案 遇到了统计模型版本

解释的工作原理如下:
对数赔率的幂运算给出变量增加一个单位的赔率比。因此,例如,如果Target_yes(1=客户接受定期存款,0=客户不接受)=1,逻辑回归系数为0.573717,那么您可以断言您的“接受”结果的几率是exp(0.573717)=1.7748519304802倍于您的“不接受”结果的几率.

我的猜测是,由于虚拟变量陷阱,当你添加常数时,你会得到一个奇异的设计矩阵。我一定会研究这是什么,以及我如何做到这一点!谢谢你抽出时间!我的猜测是,由于虚拟变量陷阱,当你添加常数时,你有一个奇异的设计矩阵。我一定会研究这是什么,以及我如何做到这一点!谢谢你抽出时间!非常感谢你为我澄清这一点!因此,接受的几率是不接受的1.77485倍。这些系数的顺序是什么,它们是否与输入变量相同??负系数的exp给出了什么?exp(-0.539109)=0.5832677124519539这是不被接受的几率吗?负数和正数仍然是相同的解释非常感谢你花时间向我解释这一点!谢谢!非常感谢你为我澄清这一点!因此,接受的几率是不接受的1.77485倍。这些系数的顺序是什么,它们是否与输入变量相同??负系数的exp给出了什么?exp(-0.539109)=0.5832677124519539这是不被接受的几率吗?负数和正数仍然是相同的解释非常感谢你花时间向我解释这一点!谢谢!
print(df)
          0         1         2         3         4         5         6  \
0 -0.040404 -0.289274 -0.604957 -0.748797 -0.206201  0.573717 -0.177778   

          7         8         9        10        11     inter  
0 -0.530802 -0.210549  0.099326 -0.539109  0.879504 -1.795323