Python 递归公式是慢循环，有没有办法让这个代码运行得更快？_Python_Pandas_Loops_For Loop_Vectorization

Python 递归公式是慢循环，有没有办法让这个代码运行得更快？

python pandas loops for-loop

Python 递归公式是慢循环，有没有办法让这个代码运行得更快？,python,pandas,loops,for-loop,vectorization,Python,Pandas,Loops,For Loop,Vectorization,我有以下数据集：计算危险率的公式为：年=1：危险率（年）=PD（年）第1年：危害率（年）=（PD（年）+危害率（年-1）*（年-1））/（年）假设：根据客户ID，年份是单调的，严格来说>0 由于这个公式是递归的，并且需要上一年的危险率，我下面的代码速度很慢，无法管理大型数据集，有没有办法将此操作矢量化，或者至少使循环更快 #Calculate the hazard rates #Initialise an array to collect the hazard rate for ea

我有以下数据集：

计算危险率的公式为：

年=1：危险率（年）=PD（年）

第1年：危害率（年）=（PD（年）+危害率（年-1）*（年-1））/（年）

假设：根据客户ID，年份是单调的，严格来说>0

由于这个公式是递归的，并且需要上一年的危险率，我下面的代码速度很慢，无法管理大型数据集，有没有办法将此操作矢量化，或者至少使循环更快

#Calculate the hazard rates
#Initialise an array to collect the hazard rate for each calculation, particularly useful for the recursive nature 
#of the formula
hr = []

#Loop through the dataframe, executing the hazard rate formula
    #If time_period (year) = 1 then the hazard rate is equal to the pd
for index, row in df.iterrows():
    if row["Year"] == 1:
        hr.append(row["PD"])
    elif row["Year"] > 1:
        #Create a row_num variable to indicate what the index is for each unique customer ID
        row_num = int(row["Year"])
        hr.append((row["PD"] + hr[row_num - 2] * (row["Year"] - 1)) / (row["Year"]))
    else:
        raise ValueError("Index contains negative or zero values")

#Attach the hazard_rates array to the dataframe
df["hazard_rate"] = hr

此函数将计算第n个危险率

computed = {1: 0.05}
def func(n, computed = computed):
    '''
    Parameters:
        @n: int, year number
        @computed: dictionary with hazard rate already computed
    Returns:
        computed[n]: n-th hazard rate
    '''

    if n not in computed:
        computed[n] = (df.loc[n,'PD'] + func(n-1, computed)*(n-1))/n

    return computed[n]

现在让我们计算每年的风险率：

df.set_index('year', inplace=True)
df['Hazard_rate'] = [func(i) for i in df.index]

请注意，该函数不关心数据帧是否按

year

排序，但是我假设数据帧是按

year

索引的

如果要返回该列，只需重置索引：

df.reset_index(inplace=True)

随着

Customer\u ID

的引入，流程更加复杂：

#Function depends upon dataframe passed as argument
def func(df, n, computed):

    if n not in computed:
        computed[n] = (df.loc[n,'PD'] + func(n-1, computed)*(n-1))/n

    return computed[n]

#Set index
df.set_index('year', inplace=True)

#Initialize Hazard_rate column
df['Hazard_rate']=0

#Iterate over each customer
for c in df['Customer_ID']:

    #Create a customer mask
    c_mask = (df['Customer_ID'] == c)

    # Initialize computed dictionary for given customer
    c_computed = {1: df.loc[c_mask].loc[1,'PD']}

    df.loc[c_mask]['Hazard_rate'] = [func(df.loc[c_mask], i, c_computed ) for i in df.loc[c_mask].index]

只是为了澄清：你在开始时说你拥有的数据集是你想要计算的，而你的数据框只有

year

和

PD

列作为开始？这样做是否有助于

df.loc[index，'hazard_rate']=*公式结果*

而不是使用列表？FBruzzesi，正确-我为人们添加了“危险率”列以验证他们的结果。在过去，我尝试使用.loc。然而，由于公式需要前面的结果，我无法使它工作。您能给我看一下吗？数据将按年份进行排序，严格说来，年份之间没有间隔，严格说来也没有0年或负年份，因为这些都是预测年份由于您现在引入了一个新变量

Customer\u ID

，如果不亲自检查，上述代码将无法按预期工作，看起来你的函数要比OP的差得多，因为你从头开始重新计算每年的整个路径，而他使用的是上一年已经计算过的结果。我可以在一个ID上运行，然后在每个ID上迭代ID@Aryerez一旦计算出一年，它就不再计算了。这是一种典型的递归方法（例如，这是python中计算斐波那契数的最快方法，请参见）@78282219只需注意在每个循环中重新初始化函数，因为该函数还初始化存储已计算内容的字典