Warning: file_get_contents(/data/phpspider/zhask/data//catemap/0/performance/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python性能改进和编码风格_Python_Performance_Pandas - Fatal编程技术网

Python性能改进和编码风格

Python性能改进和编码风格,python,performance,pandas,Python,Performance,Pandas,问题 假设给出了以下稀疏表,表示索引上的安全性列表 identifier from thru AAPL 1964-03-31 -- ABT 1999-01-03 2003-12-31 ABT 2005-12-31 -- AEP 1992-01-15 2017-08-31 KO 2014-12-31 -- 例如,从1999-01-03到2003-12-31,以及从20

问题

假设给出了以下稀疏表,表示索引上的安全性列表

identifier    from        thru
AAPL          1964-03-31  --
ABT           1999-01-03  2003-12-31
ABT           2005-12-31  --
AEP           1992-01-15  2017-08-31
KO            2014-12-31  --
例如,从1999-01-03到2003-12-31,以及从2005-12-31到今天,ABT都在索引中(--表示今天)。在两者之间的时间内,它不会列在索引上

如何有效地将此稀疏表转换为以下形式的密集表

date          AAPL  ABT  AEP  KO
1964-03-31     1     0    0    0
1964-04-01     1     0    0    0
...           ...   ...  ...  ...
1999-01-03     1     1    1    0
1999-01-04     1     1    1    0
...           ...   ...  ...  ...
2003-12-31     1     1    1    0
2004-01-01     1     0    1    0
...           ...   ...  ...  ...
2017-09-04     1     1    0    1
我的解决方案部分您将找到我对问题的解决方案。不幸的是,代码的性能似乎很差。处理1648个条目花费了大约22秒

由于我是python新手,我想知道如何有效地编程这些问题

我不希望任何人为我的问题提供解决方案(除非你愿意这样做)。我的主要目标是理解如何在python中有效地解决类似的问题。我使用熊猫的功能来匹配相应的条目。我应该改用numpy和索引吗?我应该使用其他工具箱吗?如何提高性能

请在下面的部分找到我解决这个问题的方法(如果您感兴趣的话)

非常感谢你的帮助


我的解决方案

我试图通过循环第一个表中的每一行条目来解决这个问题。在每个循环中,我为特定的from thru间隔指定一个布尔矩阵,所有元素都设置为True。此矩阵将附加到列表中。最后,我对列表进行pd.concat处理,取消堆栈并重新编制结果数据帧的索引

import pandas as pd
import numpy as np

def get_ts_data(data, start_date, end_date, attribute=None, identifier=None, frequency=None):
    """
    Transform sparse table to dense table.

    Parameters
    ----------
    data: pd.DataFrame
        sparse table with minimal column specification ['identifier', 'from', 'thru'
    start_date: pd.Timestamp, str
        start date of the dense matrix
    end_date: pd.Timestamp, str
        end date of the dense matrix
    attribute: str
        column name of the value of the dense matrix.
    identifier: str
        column name of the identifier
    frequency: str
        frequency of the dense matrix
    kwargs:
        Allows to overwrite naming of 'from' and 'thru' variables.

        e.g.

        {'from': 'start', 'thru': 'end'}

    Returns
    -------

    """

    if attribute is None:
        attribute = ['on_index']
    elif not isinstance(attribute, list):
        attribute = [attribute]

    if identifier is None:
        identifier = ['identifier']
    elif not isinstance(identifier, list):
        identifier = [identifier]

    if frequency is None:
        frequency = 'B'

    # copy data for security reasons
    data_mod = data.copy()
    data_mod['on_index'] = True

    # specify start date and check type
    if not isinstance(start_date, pd.Timestamp):
        start_date = pd.Timestamp(start_date)

    # specify end date and check type
    if not isinstance(end_date, pd.Timestamp):
        end_date = pd.Timestamp(end_date)

    # specify output date range
    date_range = pd.date_range(start_date, end_date, freq=frequency)

    #overwrite null indicating that it is valid until today
    missing = data_mod['thru'].isnull()
    data_mod.loc[missing, 'thru'] = data_mod.loc[missing, 'from'].apply(lambda d: max(d, end_date))

    # preallocate frms
    frms = []

    # add dataframe to frms with time specific entries
    for index, row in data_mod.iterrows():
        # date range index
        d_range = pd.date_range(row['from'], row['thru'], freq=frequency)

        # Multi index with date and identifier
        d_index = pd.MultiIndex.from_product([d_range] + [[x] for x in row[identifier]], names=['date'] + identifier)

        # add DataFrame with repeated values to list
        frms.append(pd.DataFrame(data=np.repeat(row[attribute].values, d_index.size), index=d_index, columns=attribute))

    out_frame = pd.concat(frms)
    out_frame = out_frame.unstack(identifier)
    out_frame = out_frame.reindex(date_range)

    return out_frame

if __name__ == "__main__":
    data = pd.DataFrame({'identifier': ['AAPL', 'ABT', 'ABT', 'AEP', 'KO'],
                         'from': [pd.Timestamp('1964-03-31'),
                                  pd.Timestamp('1999-01-03'),
                                  pd.Timestamp('2005-12-31'),
                                  pd.Timestamp('1992-01-15'),
                                  pd.Timestamp('2014-12-31')],
                         'thru': [np.nan,
                                  pd.Timestamp('2003-12-31'),
                                  np.nan,
                                  pd.Timestamp('2017-08-31'),
                                  np.nan]
                         })

    transformed_data = get_ts_data(data, start_date='1964-03-31', end_date='2017-09-04', attribute='on_index', identifier='identifier', frequency='B')
    print(transformed_data)
#确保日期是时间戳。
df['from']=pd.DatetimeIndex(df['from'])
df['thru']=pd.DatetimeIndex(df['thru'].replace('--',np.nan))
#获取所有唯一日期的排序列表,并为整个范围创建索引。
日期=排序(设置(df['from'].tolist()+df['thru'].dropna().tolist())
dti=pd.DatetimeIndex(开始=日期[0],结束=日期[-1],频率='B')
#基于符号和完整日期范围创建新的目标数据框。初始化为零。
df2=pd.DataFrame(0,columns=df['identifier'].unique(),index=dti)
#查找所有活动符号,并将其符号值从各自的“开始”日期设置为一。
对于,df[df['thru'].isnull()]中的行。iterrows():
df2.loc[df2.index>=行['from'],行['identifier']]=1
#查找所有其他符号,并将其符号值设置为各自的“开始日期”和“结束日期”之间的值。
对于,df[df['thru'].notnull()]中的行。iterrows():
df2.loc[(df2.index>=行['from'])和(df2.index>>df2.head(3)
AAPL ABT AEP KO
1964-03-31     1    0    0   0
1964-04-01     1    0    0   0
1964-04-02     1    0    0   0
>>>df2.尾部(3)
AAPL ABT AEP KO
2017-08-29     1    1    1   1
2017-08-30     1    1    1   1
2017-08-31     1    1    1   1
>>>df2.loc[:'2004-01-02','ABT'].tail()
2003-12-29    1
2003-12-30    1
2003-12-31    1
2004-01-01    0
2004-01-02    0
频率:B,名称:ABT,数据类型:int64
>>>df2.loc['2005-12-30':,'ABT']标题(3)
2005-12-30    0
2006-01-02    1
2006-01-03    1
频率:B,名称:ABT,数据类型:int64

谢谢@Alexander。我很感激。这很好。在我的示例中,您的解决方案快了76倍左右。
# Ensure dates are Pandas timestamps.
df['from'] = pd.DatetimeIndex(df['from'])
df['thru'] = pd.DatetimeIndex(df['thru'].replace('--', np.nan))

# Get sorted list of all unique dates and create index for full range.
dates = sorted(set(df['from'].tolist() + df['thru'].dropna().tolist()))
dti = pd.DatetimeIndex(start=dates[0], end=dates[-1], freq='B')

# Create new target dataframe based on symbols and full date range.  Initialize to zero.
df2 = pd.DataFrame(0, columns=df['identifier'].unique(), index=dti)

# Find all active symbols and set their symbols' values to one from their respective `from` dates.
for _, row in df[df['thru'].isnull()].iterrows():
    df2.loc[df2.index >= row['from'], row['identifier']] = 1

# Find all other symbols and set their symbols' values to one between their respective `from` and `thru` dates.
for _, row in df[df['thru'].notnull()].iterrows():
    df2.loc[(df2.index >= row['from']) & (df2.index <= row['thru']), row['identifier']] = 1

>>> df2.head(3)
            AAPL  ABT  AEP  KO
1964-03-31     1    0    0   0
1964-04-01     1    0    0   0
1964-04-02     1    0    0   0

>>> df2.tail(3)
            AAPL  ABT  AEP  KO
2017-08-29     1    1    1   1
2017-08-30     1    1    1   1
2017-08-31     1    1    1   1

>>> df2.loc[:'2004-01-02', 'ABT'].tail()
2003-12-29    1
2003-12-30    1
2003-12-31    1
2004-01-01    0
2004-01-02    0
Freq: B, Name: ABT, dtype: int64

>>> df2.loc['2005-12-30':, 'ABT'].head(3)
2005-12-30    0
2006-01-02    1
2006-01-03    1
Freq: B, Name: ABT, dtype: int64