Python 优化应用引用多个DF'的用户定义函数；s 有希望_Python_Pandas_Optimization_Apply

Python 优化应用引用多个DF'的用户定义函数；s 有希望

python pandas optimization

Python 优化应用引用多个DF'的用户定义函数；s 有希望,python,pandas,optimization,apply,Python,Pandas,Optimization,Apply,我想根据以下逻辑将一列Adj值附加到dflist中的每个数据帧：通过将[FICO，LTV]列中的max le（最大值小于或等于）值与查找数据帧（afltv和gfltv）相匹配，可以找到Adj值有条件地使用查找数据帧：如果输入行的Investor列是'a'，则使用afltv。否则将使用gfltv 所需输出： import pandas as pd afltv = pd.DataFrame({'FICO': [0, 0, 700, 700], 'L

我想根据以下逻辑将一列

Adj

值附加到

dflist

中的每个数据帧：

通过将
```
[FICO，LTV]
```
列中的max le（最大值小于或等于）值与查找数据帧（
```
afltv
```
和
```
gfltv
```
）相匹配，可以找到
```
Adj
```
值
有条件地使用查找数据帧：如果输入行的
```
Investor
```
列是
```
'a'
```
，则使用
```
afltv
```
。否则将使用
```
gfltv
```

所需输出：

import pandas as pd
afltv = pd.DataFrame({'FICO': [0, 0, 700, 700],
                      'LTV': [0, 70, 0, 70],
                      'Adj': [10, 11, 12, 13]})
gfltv = pd.DataFrame({'FICO': [0, 0, 700, 700],
                      'LTV': [0, 70, 0, 70],
                      'Adj': [1, 2, 3, 4]})
df = pd.DataFrame({'Investor': ['a','a','e','f'],
                      'FICO': [600, 699, 700, 701],
                      'LTV': [69, 70, 71, 90]})
df2 = pd.DataFrame({'Investor': ['a','a','e','f'],
                      'FICO': [600, 699, 700, 701],
                      'LTV': [69, 70, 71, 90]})
dflist = [df,df2]

前科我有一个大约需要3个小时才能完成的40万行的实现

根据investor列，设置

表

变量。例如，如果

df.Investor=='a'

那么

table=afltv

将

df.FICO

和

df.ltv

（与df2相同）分别转换为

table.FICO

和

table.ltv

中最接近的值，无需进行转换。例如，如果df.FICO=699，并且table.FICO中的值为0和700，则转换结果应为0

将步骤2的结果存储在变量

cscore

和

lscore

中（对每个变量执行步骤2中描述的相同过程）

将.loc与步骤3中的变量一起使用，从中返回标量值步骤1中设置的

表

变量

所需的输出是这样产生的

def find_value(row):

###Based on df.Investor (passed as row.Investor), set 'table' to be one 
###   of the df's established above - those df's contain the desired 
###   results values in the 'adj' column

    if row['Investor'] == 'a':
        table = afltv.copy()
    else:
        table = gfltv.copy() 
       
###Convert FICO (described in step 2) and store in cscore 

    table.drop(table[table.FICO>row['FICO']].index, inplace=True)
    table.reset_index(drop=True, inplace=True)
    cscore = table.loc[(table['FICO']-row['FICO']).abs().argsort(), 'FICO'].values[0]

###Convert LTV as described in step 2

    table.drop(table[table.LTV>row['LTV']].index, inplace=True)
    table.reset_index(drop=True, inplace=True)
    lscore = table.loc[(table['LTV']-row['LTV']).abs().argsort(), 'LTV'].values[0]

###Use .loc and the variables we set in order to return a scalar value from
###   table.adj

    adj = table.loc[(table['LTV']==lscore) & (table['FICO']==cscore), 'Adj'].values

    return adj

我还有多个函数，它们以类似的方式工作，获取一行并返回一个标量值。我尝试将列序列传递到函数中以加快速度，但我有返回布尔值的比较，因此这似乎不起作用

我正在寻求任何改进建议。我有几个关于优化的问题：

通过数据帧循环应用函数是一种糟糕的做法吗？我应该将它们合并为一个df并应用一次吗

在我应用并运行的功能中创建新的df.loc是否效率低下

基于问题2，在函数中转换我用于.loc（FICO和LTV）的值，但跳过.loc部分是否更有意义？我可以在函数之外进行合并，而不是.loc

钥匙

键中的键是，它完美地解决了“最大le（小于或等于）”的逻辑。该函数也根据文档进行矢量化

*fltv

数据帧实际上是（

FICO x LTV

）的“扁平”二维表。如果将它们向后堆叠，则可以使用

1.

中的坐标直接找到

Adj

值

用于获取

2.

中的值

代码在0.53秒内完成800k行（分别在

df

和

df2

中完成400k行）

代码输出结果

import pandas as pd
import numpy as np

# Use *fltv, df, df2, dflist as given

#  cond =  0    , 1
ls_fltv = [afltv, gfltv]

# construct lookup tables as 2D (RICO x LTV) arrays
ls_tb = [fltv.sort_values(["FICO", "LTV"])
             .set_index(["FICO", "LTV"])["Adj"]
             .unstack(level=-1) for fltv in ls_fltv]

def search_subdf(df, cond):
    """Search on a subset of df based on condition"""

    # get df subset
    if cond == 0:
        df_sub = df[df['Investor'] == 'a']
    elif cond == 1:
        df_sub = df[df['Investor'] != 'a']

    # get lookup table
    tb = ls_tb[cond]

    # Search FICO in the index and LTV in the columns
    FICO_where = np.searchsorted(tb.index.values, df_sub["FICO"].values, side="right") - 1
    LTV_where = np.searchsorted(tb.columns.values, df_sub["LTV"].values, side="right") - 1

    # append the Adj column
    return df_sub.assign(Adj=tb.values[FICO_where, LTV_where])
    
def search(df, n_cond):
    """Search through all conditions"""
    # compute the results from cond 0 and 1, and concat vertically
    return pd.concat([search_subdf(df, i) for i in range(n_cond)])

# execute
for i, item in enumerate(dflist):
    dflist[i] = search(item, len(ls_fltv))

查找表

for item in dflist:
    print(item)
    print()

  Investor  FICO  LTV  Adj
0        a   600   69   10
1        a   699   70   11
2        e   700   71    4
3        f   701   90    4

  Investor  FICO  LTV  Adj
0        a   600   69   10
1        a   699   70   11
2        e   700   71    4
3        f   701   90    4

笔记以下是我对优化的建议：

当效率成为问题时，在
```
numpy
```
数组上使用矢量化操作。
- 不要重新发明轮子。如果逻辑似乎不是非常罕见的，那就去找方向盘
避免在数据帧上进行显式迭代。
- 如果
```
for
```
  循环或
```
.apply
```
  不可避免（它们在效率方面类似），请尝试使用数组而不是数据帧
- 不要使用
```
.iterrows（）
```
  、
```
.itertuples（）
```
  或
```
.append（）
```

您是否可以在a中提供样本数据和预期输出？否则人们将无法进行测试。要点很好-我将收集样本数据并更新帖子。感谢您提供的样本数据和预期输出，感谢您的反馈！请您用清晰的文字解释一下获取

值

列的逻辑？例如，定义良好的规则列表或查找值的步骤序列。不要含糊其辞。几乎所有的术语，如“匹配”、“最近”、“我想查看的表”等，都没有明确定义。我建议您不要期望潜在的回答者从冗长的代码中推断出这些逻辑。试着把重点放在定义输出的需求上，而不是描述代码本身（不需要的版本）。更新了问题，使其更清楚，再次感谢您的反馈。您的评论结尾暗示我不希望代码以当前的方式运行，但它现在确实可以正常工作，我得到了正确的结果-我希望得到关于如何优化代码的建议，使其运行得更快。这就是为什么我要解释我所做的事情背后的逻辑，以防它们是完成相同步骤的更有效的方法

import pandas as pd
import numpy as np

# Use *fltv, df, df2, dflist as given

#  cond =  0    , 1
ls_fltv = [afltv, gfltv]

# construct lookup tables as 2D (RICO x LTV) arrays
ls_tb = [fltv.sort_values(["FICO", "LTV"])
             .set_index(["FICO", "LTV"])["Adj"]
             .unstack(level=-1) for fltv in ls_fltv]

def search_subdf(df, cond):
    """Search on a subset of df based on condition"""

    # get df subset
    if cond == 0:
        df_sub = df[df['Investor'] == 'a']
    elif cond == 1:
        df_sub = df[df['Investor'] != 'a']

    # get lookup table
    tb = ls_tb[cond]

    # Search FICO in the index and LTV in the columns
    FICO_where = np.searchsorted(tb.index.values, df_sub["FICO"].values, side="right") - 1
    LTV_where = np.searchsorted(tb.columns.values, df_sub["LTV"].values, side="right") - 1

    # append the Adj column
    return df_sub.assign(Adj=tb.values[FICO_where, LTV_where])
    
def search(df, n_cond):
    """Search through all conditions"""
    # compute the results from cond 0 and 1, and concat vertically
    return pd.concat([search_subdf(df, i) for i in range(n_cond)])

# execute
for i, item in enumerate(dflist):
    dflist[i] = search(item, len(ls_fltv))

for item in dflist:
    print(item)
    print()

  Investor  FICO  LTV  Adj
0        a   600   69   10
1        a   699   70   11
2        e   700   71    4
3        f   701   90    4

  Investor  FICO  LTV  Adj
0        a   600   69   10
1        a   699   70   11
2        e   700   71    4
3        f   701   90    4

for tb in ls_tb:
    print(tb)
    print()

LTV   0   70
FICO        
0     10  11
700   12  13

LTV   0   70
FICO        
0      1   2
700    3   4