Python 3.x 获得;PerfectSeparationError:“;使用statsmodels.GLM()类进行逻辑回归时

Python 3.x 获得;PerfectSeparationError:“;使用statsmodels.GLM()类进行逻辑回归时,python-3.x,machine-learning,data-science,logistic-regression,statsmodels,Python 3.x,Machine Learning,Data Science,Logistic Regression,Statsmodels,我正在尝试使用statsmodels库中的GLM类拟合逻辑回归模型。我有一个玩具数据集,共有1250条记录和8个自变量 Predictor: ['Year', 'Lag1', 'Lag2', 'Lag3', 'Lag4', 'Lag5', 'Volume', 'Today'] train dataset shape: (1000, 9) # 9 predictors including Intercept Test dataset shape : (250, 9) <class '

我正在尝试使用
statsmodels
库中的
GLM
类拟合逻辑回归模型。我有一个玩具数据集,共有1250条记录和8个自变量

Predictor: ['Year', 'Lag1', 'Lag2', 'Lag3', 'Lag4', 'Lag5', 'Volume', 'Today']
train dataset shape: (1000, 9)    # 9 predictors including Intercept
Test dataset shape : (250, 9)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1250 entries, 0 to 1249
Data columns (total 9 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Year       1250 non-null   int64  
 1   Lag1       1250 non-null   float64
 2   Lag2       1250 non-null   float64
 3   Lag3       1250 non-null   float64
 4   Lag4       1250 non-null   float64
 5   Lag5       1250 non-null   float64
 6   Volume     1250 non-null   float64
 7   Today      1250 non-null   float64
 8   Direction  1250 non-null   object 
dtypes: float64(7), int64(1), object(1)
memory usage: 88.0+ KB

在分为训练和测试之后,当我适合模型时,我会得到以下错误以及运行时警告:在true\u divide中遇到无效值

以下是整个错误:

C:\ProgramData\Anaconda3\lib\site-packages\statsmodels\genmod\families\family.py:894: RuntimeWarning: invalid value encountered in true_divide
  n_endog_mu = self._clean((1. - endog) / (1. - mu))
C:\ProgramData\Anaconda3\lib\site-packages\statsmodels\genmod\families\links.py:188: RuntimeWarning: overflow encountered in exp
  t = np.exp(-z)
C:\ProgramData\Anaconda3\lib\site-packages\statsmodels\genmod\families\family.py:893: RuntimeWarning: invalid value encountered in true_divide
  endog_mu = self._clean(endog / mu)
---------------------------------------------------------------------------
PerfectSeparationError                    Traceback (most recent call last)
<ipython-input-166-66b45a65beec> in <module>
      2 
      3 glm4 = sm.GLM(endog = y4_train, exog = X4_train[['Lag1', 'Lag2', 'Lag3', 'Lag4', 'Lag5', 'Volume', 'Today']], family = sm.families.Binomial())
----> 4 glm4_result = glm4.fit()
      5 glm4_result.summary()

C:\ProgramData\Anaconda3\lib\site-packages\statsmodels\genmod\generalized_linear_model.py in fit(self, start_params, maxiter, method, tol, scale, cov_type, cov_kwds, use_t, full_output, disp, max_start_irls, **kwargs)
   1025             return self._fit_irls(start_params=start_params, maxiter=maxiter,
   1026                                   tol=tol, scale=scale, cov_type=cov_type,
-> 1027                                   cov_kwds=cov_kwds, use_t=use_t, **kwargs)
   1028         else:
   1029             self._optim_hessian = kwargs.get('optim_hessian')

C:\ProgramData\Anaconda3\lib\site-packages\statsmodels\genmod\generalized_linear_model.py in _fit_irls(self, start_params, maxiter, tol, scale, cov_type, cov_kwds, use_t, **kwargs)
   1172             if endog.squeeze().ndim == 1 and np.allclose(mu - endog, 0):
   1173                 msg = "Perfect separation detected, results not available"
-> 1174                 raise PerfectSeparationError(msg)
   1175             converged = _check_convergence(criterion, iteration + 1, atol,
   1176                                            rtol)

PerfectSeparationError: Perfect separation detected, results not available

C:\ProgramData\Anaconda3\lib\site packages\statsmodels\genmod\families\family.py:894:RuntimeWarning:true\u divide中遇到无效值
n_endog_mu=自清洁((1.-endog)/(1.-mu))
C:\ProgramData\Anaconda3\lib\site packages\statsmodels\genmod\families\links.py:188:RuntimeWarning:exp中遇到溢出
t=np.exp(-z)
C:\ProgramData\Anaconda3\lib\site packages\statsmodels\genmod\families\family.py:893:RuntimeWarning:true\u divide中遇到无效值
endog\u mu=自清洁(endog/mu)
---------------------------------------------------------------------------
PerfectSeparationError回溯(最后一次最近调用)
在里面
2.
3 glm4=sm.GLM(endog=y4_列,exog=X4_列[[Lag1',Lag2',Lag3',Lag4',Lag5',Volume',Today'],family=sm.families.Binomial())
---->4 glm4_结果=glm4.fit()
5 glm4_结果摘要()
C:\ProgramData\Anaconda3\lib\site packages\statsmodels\genmod\generalized\u linear\u model.py拟合(self、start\u参数、maxiter、method、tol、scale、cov\u类型、cov\u kwds、use\t、full\u输出、disp、max\u start\u irls、**kwargs)
1025返回自拟合irls(开始参数=开始参数,最大值=最大值,
1026 tol=tol,scale=scale,cov_类型=cov_类型,
->1027 cov_kwds=cov_kwds,use_t=use_t,**kwargs)
1028其他:
1029 self.\u optim\u hessian=kwargs.get('optim\u hessian'))
C:\ProgramData\Anaconda3\lib\site packages\statsmodels\genmod\generalized_linear_model.py in_fit_irls(self、start_参数、maxiter、tol、scale、cov_类型、cov_kwds、use_t、**kwargs)
1172如果endog.squence().ndim==1且np.allclose(mu-endog,0):
1173 msg=“检测到完美分离,结果不可用”
->1174升起完美分离错误(msg)
1175收敛=_检查_收敛(标准,迭代+1,atol,
1176(rtol)
PerfectSeparationError:检测到完美分离,结果不可用
然而,我可以看到,如果我只保留6个预测值或更少,那么fit()方法可以工作,然后我可以看到摘要(),但是如果我超过6个,那么我将继续得到这个“PerfectSeparationError”

那么,这个类statsmodels.GLM()是否只接受6个预测器,在我看来,它一定不是这样的,否则这个类有什么用!!我很可能在代码中犯了一些错误,但我无法发现,因此有人能帮我做必要的更正吗


如果“今天”字段是一个日期,并且每个日期有一条记录,则无法不将它们完全分开。@DejaVuSansMono预测器“今天”不包含日期时间记录……其类型为
float64
它代表什么,今天功能?返回百分比,我的意思是,当一个预测指标中的所有或几乎所有值都只与一个结果相关时,完美分离就会发生。在这种情况下,您无法找到预测系数的解决方案。你应该交叉统计你的预测值和结果,找到一个零观察值。
C:\ProgramData\Anaconda3\lib\site-packages\statsmodels\genmod\families\family.py:894: RuntimeWarning: invalid value encountered in true_divide
  n_endog_mu = self._clean((1. - endog) / (1. - mu))
C:\ProgramData\Anaconda3\lib\site-packages\statsmodels\genmod\families\links.py:188: RuntimeWarning: overflow encountered in exp
  t = np.exp(-z)
C:\ProgramData\Anaconda3\lib\site-packages\statsmodels\genmod\families\family.py:893: RuntimeWarning: invalid value encountered in true_divide
  endog_mu = self._clean(endog / mu)
---------------------------------------------------------------------------
PerfectSeparationError                    Traceback (most recent call last)
<ipython-input-166-66b45a65beec> in <module>
      2 
      3 glm4 = sm.GLM(endog = y4_train, exog = X4_train[['Lag1', 'Lag2', 'Lag3', 'Lag4', 'Lag5', 'Volume', 'Today']], family = sm.families.Binomial())
----> 4 glm4_result = glm4.fit()
      5 glm4_result.summary()

C:\ProgramData\Anaconda3\lib\site-packages\statsmodels\genmod\generalized_linear_model.py in fit(self, start_params, maxiter, method, tol, scale, cov_type, cov_kwds, use_t, full_output, disp, max_start_irls, **kwargs)
   1025             return self._fit_irls(start_params=start_params, maxiter=maxiter,
   1026                                   tol=tol, scale=scale, cov_type=cov_type,
-> 1027                                   cov_kwds=cov_kwds, use_t=use_t, **kwargs)
   1028         else:
   1029             self._optim_hessian = kwargs.get('optim_hessian')

C:\ProgramData\Anaconda3\lib\site-packages\statsmodels\genmod\generalized_linear_model.py in _fit_irls(self, start_params, maxiter, tol, scale, cov_type, cov_kwds, use_t, **kwargs)
   1172             if endog.squeeze().ndim == 1 and np.allclose(mu - endog, 0):
   1173                 msg = "Perfect separation detected, results not available"
-> 1174                 raise PerfectSeparationError(msg)
   1175             converged = _check_convergence(criterion, iteration + 1, atol,
   1176                                            rtol)

PerfectSeparationError: Perfect separation detected, results not available