Python 熊猫数据帧自定义前向填充优化问题:_Python_Python 3.x_Performance_Pandas

Python 熊猫数据帧自定义前向填充优化问题:

python python-3.x performance pandas

Python 熊猫数据帧自定义前向填充优化问题:,python,python-3.x,performance,pandas,Python,Python 3.x,Performance,Pandas,亲爱的大家，我有一个函数，可以根据dataframe的某些规范列进行同步这些函数可以工作，但是我想知道如何工作：提高性能让它更具pythonic 请随时留下任何建议，谢谢函数、示例和结果功能规格投入： df：带有列的数据帧： [a0，…aN]：a0到aN名称可以是任何有效的字符串，并且包含数值值 [代理，日期]：是固定名称，代理包含数值值，日期包含日期时间 sync\u with：要与之同步的列（包含在[a0，…，an]中的字符串或字符串列表），或默认情况下，一个空列

亲爱的大家，
我有一个函数，可以根据

dataframe

的某些规范列进行同步
这些函数可以工作，但是我想知道如何工作：

提高性能
让它更具pythonic

请随时留下任何建议，
谢谢

函数、示例和结果功能规格

投入：

df
：带有列的

数据帧：

[a0，…aN]
：a0
到aN
名称可以是任何有效的字符串
，并且包含数值
值
[代理，日期]
：是固定名称，代理
包含数值
值，日期
包含日期时间


sync\u with：要与之同步的列（包含在[a0，…，an]
中的字符串或字符串列表），或默认情况下，一个空列表
以同步所有[a0，…，an]


同步：

按代理值分组，向前填充
删除所有要与值同步的列都为空的行

返回：同步的数据帧

以下是我的功能：
import pandas as pd
import numpy as np

def synchronize(df,sync_with=[]):
    _df = df.copy()

    if not isinstance(sync_with,list):
        sync_with = [sync_with]

    _fixed_cols = ['date','agent']
    _fixed_cols.extend(sync_with)
    _colset = [c for c in _df.columns if c not in _fixed_cols]

    for ag in _df.agent.unique():
        _df.loc[_df.agent==ag,_colset] = _df.loc[_df.agent==ag,_colset].fillna(method='ffill')
        if sync_with:
            _df = _df.dropna(how='all', subset=sync_with)
            _df.loc[_df.agent==ag,:] = _df.loc[_df.agent==ag,:].fillna(method='ffill')

    return _df

样品
结果
foo = pd.DataFrame(dict(date=pd.to_datetime(['2010', '2011', '2012', '2013', '2010', '2013', '2015', '2016']),
                        agent=[1,1,1,1,2,2,2,2],
                        _a=[1, np.nan, np.nan, 4, 5, np.nan, 7, 8],
                        _b=[11, 22, np.nan, np.nan, 55, np.nan, 77, np.nan],
                        _c=[111, np.nan, 333, np.nan, np.nan, 666, 777, np.nan]))

# 1. default (10.1 ms per loop)
print(synchronize(foo))
    _a    _b     _c  agent       date
0  1.0  11.0  111.0      1 2010-01-01
1  1.0  22.0  111.0      1 2011-01-01
2  1.0  22.0  333.0      1 2012-01-01
3  4.0  22.0  333.0      1 2013-01-01
4  5.0  55.0    NaN      2 2010-01-01
5  5.0  55.0  666.0      2 2013-01-01
6  7.0  77.0  777.0      2 2015-01-01
7  8.0  77.0  777.0      2 2016-01-01

# 2. sync with one column (54.9 ms per loop)
print(synchronize(foo,'_c'))
    _a    _b     _c  agent       date
0  1.0  11.0  111.0      1 2010-01-01
2  1.0  22.0  333.0      1 2012-01-01
5  NaN   NaN  666.0      2 2013-01-01
6  7.0  77.0  777.0      2 2015-01-01

# 3. sync with two columns (53.4 ms per loop)
print(synchronize(foo,['_a','_b'))
    _a    _b     _c  agent       date
0  1.0  11.0  111.0      1 2010-01-01
1  1.0  22.0  111.0      1 2011-01-01
3  4.0  22.0  333.0      1 2013-01-01
4  5.0  55.0    NaN      2 2010-01-01
6  7.0  77.0  777.0      2 2015-01-01
7  8.0  77.0  777.0      2 2016-01-01