Python 3.x 获得；PerfectSeparationError:“；使用statsmodels.GLM（）类进行逻辑回归时_Python 3.x_Machine Learning_Data Science_Logistic Regression_Statsmodels

Python 3.x 获得；PerfectSeparationError:“；使用statsmodels.GLM（）类进行逻辑回归时

python-3.x machine-learning

Python 3.x 获得；PerfectSeparationError:“；使用statsmodels.GLM（）类进行逻辑回归时,python-3.x,machine-learning,data-science,logistic-regression,statsmodels,Python 3.x,Machine Learning,Data Science,Logistic Regression,Statsmodels,我正在尝试使用statsmodels库中的GLM类拟合逻辑回归模型。我有一个玩具数据集，共有1250条记录和8个自变量 Predictor: ['Year', 'Lag1', 'Lag2', 'Lag3', 'Lag4', 'Lag5', 'Volume', 'Today'] train dataset shape: (1000, 9) # 9 predictors including Intercept Test dataset shape : (250, 9) <class '

我正在尝试使用

statsmodels

库中的

GLM

类拟合逻辑回归模型。我有一个玩具数据集，共有1250条记录和8个自变量

Predictor: ['Year', 'Lag1', 'Lag2', 'Lag3', 'Lag4', 'Lag5', 'Volume', 'Today']
train dataset shape: (1000, 9)    # 9 predictors including Intercept
Test dataset shape : (250, 9)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1250 entries, 0 to 1249
Data columns (total 9 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Year       1250 non-null   int64  
 1   Lag1       1250 non-null   float64
 2   Lag2       1250 non-null   float64
 3   Lag3       1250 non-null   float64
 4   Lag4       1250 non-null   float64
 5   Lag5       1250 non-null   float64
 6   Volume     1250 non-null   float64
 7   Today      1250 non-null   float64
 8   Direction  1250 non-null   object 
dtypes: float64(7), int64(1), object(1)
memory usage: 88.0+ KB

在分为训练和测试之后，当我适合模型时，我会得到以下错误以及运行时警告：在true\u divide中遇到无效值
以下是整个错误：

C:\ProgramData\Anaconda3\lib\site-packages\statsmodels\genmod\families\family.py:894: RuntimeWarning: invalid value encountered in true_divide n_endog_mu = self._clean((1. - endog) / (1. - mu)) C:\ProgramData\Anaconda3\lib\site-packages\statsmodels\genmod\families\links.py:188: RuntimeWarning: overflow encountered in exp t = np.exp(-z) C:\ProgramData\Anaconda3\lib\site-packages\statsmodels\genmod\families\family.py:893: RuntimeWarning: invalid value encountered in true_divide endog_mu = self._clean(endog / mu) --------------------------------------------------------------------------- PerfectSeparationError Traceback (most recent call last) <ipython-input-166-66b45a65beec> in <module> 2 3 glm4 = sm.GLM(endog = y4_train, exog = X4_train[['Lag1', 'Lag2', 'Lag3', 'Lag4', 'Lag5', 'Volume', 'Today']], family = sm.families.Binomial()) ----> 4 glm4_result = glm4.fit() 5 glm4_result.summary() C:\ProgramData\Anaconda3\lib\site-packages\statsmodels\genmod\generalized_linear_model.py in fit(self, start_params, maxiter, method, tol, scale, cov_type, cov_kwds, use_t, full_output, disp, max_start_irls, **kwargs) 1025 return self._fit_irls(start_params=start_params, maxiter=maxiter, 1026 tol=tol, scale=scale, cov_type=cov_type, -> 1027 cov_kwds=cov_kwds, use_t=use_t, **kwargs) 1028 else: 1029 self._optim_hessian = kwargs.get('optim_hessian') C:\ProgramData\Anaconda3\lib\site-packages\statsmodels\genmod\generalized_linear_model.py in _fit_irls(self, start_params, maxiter, tol, scale, cov_type, cov_kwds, use_t, **kwargs) 1172 if endog.squeeze().ndim == 1 and np.allclose(mu - endog, 0): 1173 msg = "Perfect separation detected, results not available" -> 1174 raise PerfectSeparationError(msg) 1175 converged = _check_convergence(criterion, iteration + 1, atol, 1176 rtol) PerfectSeparationError: Perfect separation detected, results not available

C:\ProgramData\Anaconda3\lib\site packages\statsmodels\genmod\families\family.py:894:RuntimeWarning:true\u divide中遇到无效值 n_endog_mu=自清洁（（1.-endog）/（1.-mu）） C:\ProgramData\Anaconda3\lib\site packages\statsmodels\genmod\families\links.py:188:RuntimeWarning:exp中遇到溢出 t=np.exp（-z） C:\ProgramData\Anaconda3\lib\site packages\statsmodels\genmod\families\family.py:893:RuntimeWarning:true\u divide中遇到无效值 endog\u mu=自清洁（endog/mu） --------------------------------------------------------------------------- PerfectSeparationError回溯（最后一次最近调用）在里面 2. 3 glm4=sm.GLM（endog=y4_列，exog=X4_列[[Lag1'，Lag2'，Lag3'，Lag4'，Lag5'，Volume'，Today']，family=sm.families.Binomial（）） ---->4 glm4_结果=glm4.fit（） 5 glm4_结果摘要（） C:\ProgramData\Anaconda3\lib\site packages\statsmodels\genmod\generalized\u linear\u model.py拟合（self、start\u参数、maxiter、method、tol、scale、cov\u类型、cov\u kwds、use\t、full\u输出、disp、max\u start\u irls、**kwargs） 1025返回自拟合irls（开始参数=开始参数，最大值=最大值， 1026 tol=tol，scale=scale，cov_类型=cov_类型， ->1027 cov_kwds=cov_kwds，use_t=use_t，**kwargs） 1028其他： 1029 self.\u optim\u hessian=kwargs.get（'optim\u hessian'）） C:\ProgramData\Anaconda3\lib\site packages\statsmodels\genmod\generalized_linear_model.py in_fit_irls（self、start_参数、maxiter、tol、scale、cov_类型、cov_kwds、use_t、**kwargs） 1172如果endog.squence（）.ndim==1且np.allclose（mu-endog，0）： 1173 msg=“检测到完美分离，结果不可用” ->1174升起完美分离错误（msg） 1175收敛=_检查_收敛（标准，迭代+1，atol， 1176（rtol） PerfectSeparationError:检测到完美分离，结果不可用
然而，我可以看到，如果我只保留6个预测值或更少，那么fit（）方法可以工作，然后我可以看到摘要（），但是如果我超过6个，那么我将继续得到这个“PerfectSeparationError”
那么，这个类statsmodels.GLM（）是否只接受6个预测器，在我看来，它一定不是这样的，否则这个类有什么用！！我很可能在代码中犯了一些错误，但我无法发现，因此有人能帮我做必要的更正吗

如果“今天”字段是一个日期，并且每个日期有一条记录，则无法不将它们完全分开。@DejaVuSansMono预测器“今天”不包含日期时间记录……其类型为
float64
它代表什么，今天功能？返回百分比，我的意思是，当一个预测指标中的所有或几乎所有值都只与一个结果相关时，完美分离就会发生。在这种情况下，您无法找到预测系数的解决方案。你应该交叉统计你的预测值和结果，找到一个零观察值。
C:\ProgramData\Anaconda3\lib\site-packages\statsmodels\genmod\families\family.py:894: RuntimeWarning: invalid value encountered in true_divide n_endog_mu = self._clean((1. - endog) / (1. - mu)) C:\ProgramData\Anaconda3\lib\site-packages\statsmodels\genmod\families\links.py:188: RuntimeWarning: overflow encountered in exp t = np.exp(-z) C:\ProgramData\Anaconda3\lib\site-packages\statsmodels\genmod\families\family.py:893: RuntimeWarning: invalid value encountered in true_divide endog_mu = self._clean(endog / mu) --------------------------------------------------------------------------- PerfectSeparationError Traceback (most recent call last) <ipython-input-166-66b45a65beec> in <module> 2 3 glm4 = sm.GLM(endog = y4_train, exog = X4_train[['Lag1', 'Lag2', 'Lag3', 'Lag4', 'Lag5', 'Volume', 'Today']], family = sm.families.Binomial()) ----> 4 glm4_result = glm4.fit() 5 glm4_result.summary() C:\ProgramData\Anaconda3\lib\site-packages\statsmodels\genmod\generalized_linear_model.py in fit(self, start_params, maxiter, method, tol, scale, cov_type, cov_kwds, use_t, full_output, disp, max_start_irls, **kwargs) 1025 return self._fit_irls(start_params=start_params, maxiter=maxiter, 1026 tol=tol, scale=scale, cov_type=cov_type, -> 1027 cov_kwds=cov_kwds, use_t=use_t, **kwargs) 1028 else: 1029 self._optim_hessian = kwargs.get('optim_hessian') C:\ProgramData\Anaconda3\lib\site-packages\statsmodels\genmod\generalized_linear_model.py in _fit_irls(self, start_params, maxiter, tol, scale, cov_type, cov_kwds, use_t, **kwargs) 1172 if endog.squeeze().ndim == 1 and np.allclose(mu - endog, 0): 1173 msg = "Perfect separation detected, results not available" -> 1174 raise PerfectSeparationError(msg) 1175 converged = _check_convergence(criterion, iteration + 1, atol, 1176 rtol) PerfectSeparationError: Perfect separation detected, results not available