如何修复此收敛错误？Python3 statsmodels_Python_Pandas_Numpy_Matplotlib_Statsmodels

如何修复此收敛错误？Python3 statsmodels

python pandas numpy matplotlib

如何修复此收敛错误？Python3 statsmodels,python,pandas,numpy,matplotlib,statsmodels,Python,Pandas,Numpy,Matplotlib,Statsmodels,这个问题与Python3 statsmodels及其一般线性模型类有关每当我的内生变量有一个值数组，使得这些值之间的距离超过一个数量级时，GLM不会收敛，它抛出一个异常。这是我的意思的一个编码示例 import pandas as pd import pyarrow.parquet as pq import numpy as np import statsmodels.api as sm import matplotlib.pyplot as plt import math col = [

这个问题与Python3 statsmodels及其一般线性模型类有关

每当我的内生变量有一个值数组，使得这些值之间的距离超过一个数量级时，GLM不会收敛，它抛出一个异常。这是我的意思的一个编码示例

import pandas as pd
import pyarrow.parquet as pq
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
import math 

col = ["a", \
       "b", \
       "c", \
       "d", \
       "e", \
       "f", \
       "g", \
       "h"]

df = pd.DataFrame(np.random.randint(low=1, high=100, size=(20, 8)), columns=col)
df["a"] = [0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01]
df2 = pd.DataFrame(np.random.randint(low=1, high=100, size=(20, 8)), columns=col)
df2["a"] = np.random.randint(low=10000, high=99999, size=(20, 1))
df3 = pd.DataFrame(np.random.randint(low=1, high=100, size=(20, 8)), columns=col)
df3["a"] = [0.01, \
            0.01, \
            0.01, \
            0.01, \
            0.01, \
            0.01, \
            np.random.randint(low=10000, high=99999), \
            0.01, \
            0.01, \
            0.01, \
            0.01, \
            0.01, \
            np.random.randint(low=10000, high=99999), \
            0.01, \
            0.01, \
            0.01, \
            0.01, \
            0.01, \
            0.01, \
            0.01]
try:
    actual = df[["a"]]

    fml1 = "a ~ log(b) + c + d + e + f + g"

    data1 = df[["b", "c", "d", "e", "f", "g"]]

    model = sm.GLM(actual, data1, formula=fml1, family=sm.families.Tweedie(link_power=1.1)).fit()
    model_pred = model.predict()
    print("SUCCESS")
except:
    print("FAILURE")
try:
    actual = df2[["a"]]

    fml1 = "a ~ log(b) + c + d + e + f + g"

    data1 = df2[["b", "c", "d", "e", "f", "g"]]

    model = sm.GLM(actual, data1, formula=fml1, family=sm.families.Tweedie(link_power=1.1)).fit()
    model_pred = model.predict()
    print("SUCCESS")
except:
    print("FAILURE")
try:
    actual = df3[["a"]]

    fml1 = "a ~ log(b) + c + d + e + f + g"

    data1 = df3[["b", "c", "d", "e", "f", "g"]]

    model = sm.GLM(actual, data1, formula=fml1, family=sm.families.Tweedie(link_power=1.1)).fit()
    model_pred = model.predict()
    print("SUCCESS")
except:
    print("FAILURE")

如果运行此代码，则应仅在最后一组数据上出现异常。为什么会这样？如何使GLM收敛？有其他选择吗？

似乎拟合Tweedie分布的参数并不容易。事实上，一组参数只有在验证所有观测点积的正性时才有效，即，否则预测中使用的值未定义为负实数，不能提升为非整数幂值

因此，在大多数优化器中，这种关系应该在所有迭代中保持，并且很难保持，特别是当数据包含不同数量级的值时

然后，我看到了处理这个问题的两个主要解决方案

最简单的方法是：强制你的系数为正。就你的情况而言，所有的观察结果都是肯定的，你将保持在可行的范围内。这可以使用牛顿解算器和回调来完成，例如：

    model = sm.GLM(actual, data1, formula=fml1,
                   family=sm.families.Tweedie(link_power=1.1))

    def callback(x):
        x[x < 0] = 0

    result = model.fit(method='newton', disp=True, start_params=np.ones(6),
                       callback=callback)

PS：我不是Tweedie分布的专家，但我一直在研究其他类似泊松分布的分布，它们也面临同样的问题

当我尝试其中一种解决方案（即明天）并验证其有效性时，我将奖励您奖金。

model = sm.GLM(actual, data1, formula=fml1,
               family=sm.families.Tweedie(link_power=1.1))
result = model.fit(method='cg', disp=True, start_params=0.1 * np.ones(6))