Python使用日时间帧数据对年数据重新采样_Python_Pandas_Vectorization_Resampling_Ohlc

Python使用日时间帧数据对年数据重新采样

python pandas

Python使用日时间帧数据对年数据重新采样,python,pandas,vectorization,resampling,ohlc,Python,Pandas,Vectorization,Resampling,Ohlc,这是我存储在df1中的日常ohlc数据的小样本 date open close high low 2019-01-01 00:00:00 3700 3800 3806 3646 2019-01-02 00:00:00 3800 3857 3880 3750 2019-01-03 00:00:00 3858 3766 3863 3729 2019-01-04 00:00:00 3768 379

这是我存储在df1中的日常ohlc数据的小样本

date                open    close   high    low
2019-01-01 00:00:00 3700    3800    3806    3646
2019-01-02 00:00:00 3800    3857    3880    3750
2019-01-03 00:00:00 3858    3766    3863    3729
2019-01-04 00:00:00 3768    3791    3821    3706
2019-01-05 00:00:00 3789    3772    3839    3756
2019-01-06 00:00:00 3776    3988    4023    3747
2019-01-07 00:00:00 3985    3972    4018    3928

我想创建一个数据帧（df2），它表示活动年份烛光在进展过程中的样子。收盘价以当日收盘价为基础，高点为1月1日至当日的最大值，低点为1月1日至当日的最小值，开盘价以当年的开盘价为基础。应该是这样的：

date                open    close   high    low
2019-01-01 00:00:00 3700    3800    3806    3646
2019-01-02 00:00:00 3700    3857    3880    3646
2019-01-03 00:00:00 3700    3766    3880    3646
2019-01-04 00:00:00 3700    3791    3880    3646
2019-01-05 00:00:00 3700    3772    3880    3646
2019-01-06 00:00:00 3700    3988    4023    3646
2019-01-07 00:00:00 3700    3972    4023    3646

我很想输入一些代码，但我在这里迷路了，我想重新采样会对我有所帮助，但它只是将全年的数据汇总成一行。我想我也可以通过每天迭代和重采样来解决这个问题，但我知道这会大大降低计算速度，所以我希望看看矢量化是否可行。这是我第一次发帖，所以请告诉我是否有需要改进的指导原则

---------------编辑------------------

这是我的完整代码，其中一年有效，但其他时间框架不起作用，希望在我从公共来源yfinance提取数据时更容易复制坏结果

import pandas as pd
import yfinance as yf

#not working
def resample_active_week(df):
    df2 = pd.DataFrame()

    # high is the max from Jan1 to current day
    df2['high'] = df.groupby(df.index.isocalendar().week)['high'].cummax()

    # low is the min from Jan1 to current day 
    df2['low'] = df.groupby(df.index.isocalendar().week)['low'].cummin()

    #close
    df2['close'] = df['close']

    # open is based on the open of the current week
    df2['open'] = df.groupby(df.index.isocalendar().week)['open'].head(1)
    df2=df2.fillna(method='ffill')

    return df2
#not working    
def resample_active_month(df):
    df2 = pd.DataFrame()

    # high is the max from Jan1 to current day
    df2['high'] = df.groupby(df.index.month)['high'].cummax()

    # low is the min from Jan1 to current day 
    df2['low'] = df.groupby(df.index.month)['low'].cummin()

    #close
    df2['close'] = df['close']

    # open is based on the open of the current month
    df2['open'] = df.groupby(df.index.month)['open'].head(1)
    df2=df2.fillna(method='ffill')

    return df2

#not working
def resample_active_quarter(df):
    df2 = pd.DataFrame()

    # high is the max from Jan1 to current day
    df2['high'] = df.groupby(df.index.quarter)['high'].cummax()

    # low is the min from Jan1 to current day 
    df2['low'] = df.groupby(df.index.quarter)['low'].cummin()

    #close
    df2['close'] = df['close']

    # open is based on the open of the current quarter
    df2['open'] = df.groupby(df.index.quarter)['open'].head(1)
    df2=df2.fillna(method='ffill')

    return df2
#working
def resample_active_year(df):
    df2 = pd.DataFrame()
    
    # high is the max from Jan1 to current day
    df2['high'] = df.groupby(df.index.year)['high'].cummax()

    # low is the min from Jan1 to current day 
    df2['low'] = df.groupby(df.index.year)['low'].cummin()

    #close
    df2['close'] = df['close']

    # open is based on the open of the current year
    df2['open'] = df.groupby(df.index.year)['open'].head(1)
    df2=df2.fillna(method='ffill')

    return df2

df = yf.download(tickers='BTC-USD', period = 'max', interval = '1d',auto_adjust = True)
df.rename(columns={'Open':'open', 'High':'high','Low':'low','Close':'close'}, inplace=True)
df = df.drop(['Volume'],axis=1)

df2 = resample_active_week(df)
df3 = resample_active_month(df)
df4 = resample_active_quarter(df)
df5 = resample_active_year(df)

with pd.ExcelWriter('ResampleOut.xlsx', engine="openpyxl", mode="w") as writer:
            df.to_excel(writer, sheet_name='df_original')
            df2.to_excel(writer, sheet_name='df2_week')
            df3.to_excel(writer, sheet_name='df3_month')
            df4.to_excel(writer, sheet_name='df4_quarter')
            df5.to_excel(writer, sheet_name='df5_year')

我可以一次更改初始价格和底价，但只有高价格需要在一个循环中处理。你如何使用矢量化处理初始价格？你不能用这个吗

df.groupby（df.index.year）['open'].first（）

@r-初学者我试过了，它完成了执行，但结果列为空。注释中的代码无法重写，因此请尝试以下代码<代码>df['date']=pd.to_datetime（df['date']）；df['date']=pd.to_datetime（df['date']）；df['open']=open\u price.values[0]当我尝试df['max']=df.groupby（df.index.year）['max'].cummax（）时，我得到的'RangeIndex'对象没有属性'year'。我想这是因为我的日期列不是我的索引。所以我尝试了这个df['max']=df.groupby（df.date.year）['max'].cummax（），我得到的'Series'对象没有属性'year'。对于打开的i get'int'对象，没有属性'replace'。我仔细检查了一下，输入了ind而不是int.set date作为索引。请参阅更新的代码。最后，如果你想重置索引，最小值和最大值都能正常工作。谢谢你的帮助。我在“开放”代码中遇到的问题是，首先它依赖于for循环迭代和向量化。没有办法让它与矢量化一起工作吗？其次，我的数据从2011年8月开始，所以当它开始的时候，它崩溃了，因为它正在寻找2011年1月1日。我知道我可以修改我的数据，使其始终从1月1日开始，但有没有一种方法可以动态完成这一点。因此，就我的情况而言，从2011年8月到2011年底，所有内容都将留白。然后从2012年开始，你的循环将把它全部捕捉到最新的行。我获得了使用df2['open']=df.groupby（df.index.year）['open'].head（1）df2=df2.fillna（method='ffill'）进行矢量化的机会。我想补充的是，这种方法很容易改为month。现在的问题是，我假设每周和每季度都能正常工作。代码完成时没有问题，输出最初在前500行数据中是正确的，然后在没有我能看到的模式的情况下，它就不再准确了。你知道为什么这个方法适用于月份和年份，而不适用于年度和季度吗？@Psycholommlm你能通过编辑（在评论部分发表文章将很难阅读）将意外的输出与

df.info（）的输出一起包含在问题部分吗
# set date as the index
df = df.set_index('date')

# high is the max from Jan1 to current day
df['max'] = df.groupby(df.index.year)['max'].cummax()

# low is the min from Jan1 to current day 
df['min'] = df.groupby(df.index.year)['min'].cummin()

# open is based on the open of the year
for ind, row in df.iterrows():
    row['open'] = df.loc[ind.replace(month=1, day=1), 'open']

# OPTIONAL: reset index
df = df.reset_index()