Performance 数据帧列“重命名”和“删除”的性能问题_Performance_Pandas

Performance 数据帧列“重命名”和“删除”的性能问题

performance pandas

Performance 数据帧列“重命名”和“删除”的性能问题,performance,pandas,Performance,Pandas,下面是函数的line_profiler记录： Wrote profile results to FM_CORE.py.lprof Timer unit: 2.79365e-07 s File: F:\FM_CORE.py Function: _rpt_join at line 1068 Total time: 1.87766 s Line # Hits Time Per Hit % Time Line Contents ===================

下面是函数的line_profiler记录：

Wrote profile results to FM_CORE.py.lprof
Timer unit: 2.79365e-07 s

File: F:\FM_CORE.py
Function: _rpt_join at line 1068
Total time: 1.87766 s

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
  1068                                           @profile
  1069                                           def _rpt_join(dfa, dfb, join_type='inner'):
  1070                                               ''' join two dataframe together by ('STK_ID','RPT_Date') multilevel index.
  1071                                                   'join_type' can be 'inner' or 'outer'
  1072                                               '''
  1073                                           
  1074         2           56     28.0      0.0      try:    # ('STK_ID','RPT_Date') are normal column
  1075         2      2936668 1468334.0     43.7          rst = pd.merge(dfa, dfb, how=join_type, on=['STK_ID','RPT_Date'], left_index=True, right_index=True)
  1076                                               except: # ('STK_ID','RPT_Date') are index
  1077                                                   rst = pd.merge(dfa, dfb, how=join_type, left_index=True, right_index=True)
  1078                                                   
  1079                                           
  1080         2           81     40.5      0.0      try: # handle 'STK_Name
  1081         2       426472 213236.0      6.3          name_combine = pd.concat([dfa.STK_Name, dfb.STK_Name])
  1082                                                   
  1083                                                   
  1084         2       900584 450292.0     13.4          nameseries = name_combine[-Series(name_combine.index.values, name_combine.index).duplicated()]
  1085                                                   
  1086         2      1138140 569070.0     16.9          rst.STK_Name_x = nameseries
  1087         2       596768 298384.0      8.9          rst = rst.rename(columns={'STK_Name_x': 'STK_Name'})
  1088         2       722293 361146.5     10.7          rst = rst.drop(['STK_Name_y'], axis=1)
  1089                                               except:
  1090                                                   pass
  1091                                           
  1092         2           94     47.0      0.0      return rst

让我惊讶的是这两行：

  1087         2       596768 298384.0      8.9          rst = rst.rename(columns={'STK_Name_x': 'STK_Name'})
  1088         2       722293 361146.5     10.7          rst = rst.drop(['STK_Name_y'], axis=1)

为什么一个简单的dataframe列重命名和删除操作会花费8.9%+10.7%的时间？无论如何，合并操作的成本只有43.7%，重命名/删除看起来不像是计算密集型操作。如何改进

rename需要一个inplace参数，因此您可以改为执行rst.renamecolumns={'STK_Name_x'：'STK_Name'}，inplace=True不确定这会有什么不同，但我希望这会有所改进，我的想法是，它必须复制数据帧，这可能是成本的原因，但这只是一个猜测。你可以发布一个带有代码的数据集吗？另外，请发布Pandasalm版本，与几乎相同的场景，我尝试尽可能地优化。pandas版本为0.11，我将在稍后返回办公室时发布数据集。