Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/350.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/arrays/13.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 时间序列数据帧Groupby 3d阵列-观察/行计数-用于LSTM_Python_Arrays_Pandas_Time Series_Lstm - Fatal编程技术网

Python 时间序列数据帧Groupby 3d阵列-观察/行计数-用于LSTM

Python 时间序列数据帧Groupby 3d阵列-观察/行计数-用于LSTM,python,arrays,pandas,time-series,lstm,Python,Arrays,Pandas,Time Series,Lstm,我有一个如下结构的时间序列,标识符列和两个值列(浮动) 称为just-df的数据帧: Date Id Value1 Value2 2014-10-01 A 1.1 1.2 2014-10-01 B 1.3 1.4 2014-10-02 A 1.5 1.6 2014-10-02 B 1.7 1.8 2014-10-03 A 3.2

我有一个如下结构的时间序列,标识符列和两个值列(浮动)

称为just-df的数据帧:

Date          Id    Value1    Value2
2014-10-01     A      1.1       1.2
2014-10-01     B      1.3       1.4
2014-10-02     A      1.5       1.6
2014-10-02     B      1.7       1.8
2014-10-03     A      3.2       4.8
2014-10-03     B      8.2       10.1
2014-10-04     A      6.1       7.2
2014-10-04     B      4.3       4.1 
我要做的是将其转换为一个数组,该数组按标识符列分组,具有3个滚动观察周期,因此我将得到以下结果:

[[[1.1 1.2]
  [1.5 1.6]   '----> ID A 10/1 to 10/3'
  [3.2 4.8]]

 [[1.3  1.4]
  [1.7  1.8]   '----> ID B 10/1 to 10/3'
  [8.2 10.1]]

 [[1.5 1.6]
  [3.2 4.8]   '----> ID A 10/2 to 10/4'
  [6.1 7.2]] 
  
 [[1.7  1.8]
  [8.2 10.1]  '----> ID B 10/2 to 10/4'
  [4.3  4.1]]]
当然,忽略数组中上面引号中的部分,但希望您能够理解。 我有一个更大的数据集,它有更多的标识符,并且可能需要更改观察计数,所以不能硬行计数。到目前为止,我倾向于采用ID列的唯一值,通过创建一个临时df并在其上迭代,一次迭代并获取3个值。 似乎有更好更快的方法可以做到这一点

“伪代码”

虽然这部分我还停留在这里,但是有一种最好的方法可以迭代temp_df


最终输出将用于LSTM模型;然而,大多数其他解决方案都被编写为不需要像列“Id”那样处理groupby方面。

以下是我最终为解决方案所做的工作,这不是最简单的,但我的问题是,一开始就没有赢得任何选美比赛

id_list = array_steps_df['Id'].unique().tolist()

# change number of steps as needed
step = 3

column_list = ['Value1', 'Value2']

master_list = []

for id in id_list:
    master_dict = {}
    for column in column_list:
        array_steps_id_df = array_steps_df.loc[array_steps_df['Id'] == id]
        array_steps_id_df = array_steps_id_df[[column]].values

        master_dict[column] = []

        for obs in range(len(array_steps_id_df)-step+1):
            start_obs = obs + step
            master_dict[column].append(array_steps_id_df[obs:start_obs,])
    master_list.append(master_dict)



for idx, dic in enumerate(master_list):
    # init arrays here
    if idx == 0:
        value1_array_init = master_list[0]['Value1']
        value2_array_init = master_list[1]['Value2']
    else:
        value1_array_init += master_list[idx]['Value1']
        value2_array_init += master_list[idx]['Value2']
        
value1_array = np.array(value1_array_init)
value2_array = np.array(value2_array_init)

all_array = np.hstack((value1_array, value2_array)).reshape((len(array_steps_df) - (step + 1), 
                                                             len(column_list),
                                                             step)).transpose(0, 2, 1) 
修复了我的错误,在末尾添加了转置,并在重塑中重新定义了特征和步骤的顺序

感谢这个网站提供了一些额外的帮助


以下是我为解决这个问题所做的工作,这不是最简单的,但我的问题是,从一开始就没有赢得任何选美比赛

id_list = array_steps_df['Id'].unique().tolist()

# change number of steps as needed
step = 3

column_list = ['Value1', 'Value2']

master_list = []

for id in id_list:
    master_dict = {}
    for column in column_list:
        array_steps_id_df = array_steps_df.loc[array_steps_df['Id'] == id]
        array_steps_id_df = array_steps_id_df[[column]].values

        master_dict[column] = []

        for obs in range(len(array_steps_id_df)-step+1):
            start_obs = obs + step
            master_dict[column].append(array_steps_id_df[obs:start_obs,])
    master_list.append(master_dict)



for idx, dic in enumerate(master_list):
    # init arrays here
    if idx == 0:
        value1_array_init = master_list[0]['Value1']
        value2_array_init = master_list[1]['Value2']
    else:
        value1_array_init += master_list[idx]['Value1']
        value2_array_init += master_list[idx]['Value2']
        
value1_array = np.array(value1_array_init)
value2_array = np.array(value2_array_init)

all_array = np.hstack((value1_array, value2_array)).reshape((len(array_steps_df) - (step + 1), 
                                                             len(column_list),
                                                             step)).transpose(0, 2, 1) 
修复了我的错误,在末尾添加了转置,并在重塑中重新定义了特征和步骤的顺序

感谢这个网站提供了一些额外的帮助


我最后重做了一点,使列更具动态性,并保持时间序列的有序,还添加了一个目标数组以保持预测的有序性。对于需要此功能的任何人,以下是功能:

def data_to_array_steps(array_steps_df, time_steps, columns_to_array, id_column):
"""
https: //www.mikulskibartosz.name/ how - to - turn - pandas - data - frame - into - time - series - input - for -rnn /
:param array_steps_df: the dataframe from the csv
:param time_steps: how many time steps
:param columns_to_array: what columns to convert to the array
:param id_column: what is to be used for the identifier
:return: data grouped in a # observations by identifier and date
"""

    id_list = array_steps_df[id_column].unique().tolist()
    date_list = array_steps_df['date'].unique().tolist()

    master_list = []
    target_list = []

    missing_counter = 0
    total_counter = 0

    # grab date size = time steps at a time and iterate through all of them
    for date in range(len(date_list) - time_steps + 1):
        date_range_test = date_list[date:time_steps+date]

        date_range_df = array_steps_df.loc[(array_steps_df['date'] <= date_range_test[-1]) &
                                           (array_steps_df['date'] >= date_range_test[0])
                                          ]

        # for each id do it separately so time series data doesn't get mixed up
        for identifier in id_list:

            # get id in here and then skip if not the required time steps/observations for the id

            date_range_id = date_range_df.loc[date_range_df[id_column] == identifier]

            master_dict = {}

        # if there aren't enough observations for the data range
            if len(date_range_id) != time_steps:

                # dont fully need the counter except in unusual circumstances when debugging it causes no harm for now
                missing_counter += 1

            else:
            # add target each loop through for the last date in the date range for the id or ticker
                target = array_steps_df['target'].\
                         loc[(array_steps_df['date'] == date_range_test[-1])
                           & (array_steps_df[id_column] == identifier)                                     
                            ].iloc[0]

                target_list.append(target)

                total_counter += 1

                # loop through each column in dataframe
                for column in columns_to_array:

                    date_range_id_value = date_range_id[[column]].values

                    master_dict[column] = []
                    master_dict[column].append(date_range_id_value)

                master_list.append(master_dict)

    # redo columns to arrays, after they have been ordered and grouped by Id above
    array_list = []

    # for each column go through the values in the array create an array for the column then append to list
    for column in columns_to_array:

        for idx, dic in enumerate(master_list):
            # init arrays here if the first value
            if idx == 0:
                 value_array_init = master_list[0][column]

            else:
                 value_array_init += master_list[idx][column]

        array_list.append(np.array(value_array_init))

    # for each value in the array list, horizontally stack each value
    all_array = np.hstack(array_list).reshape((total_counter,
                                               len(columns_to_array),
                                               time_steps
                                               )
                                             ).transpose(0, 2, 1)

    target_array_all = np.array(target_list
                                ).reshape(len(target_list),
                                          1)

    # should probably make this an if condition later after a few more tests
    print('check of length of arrays', len(all_array), len(target_array_all))

    return all_array, target_array_all
def data_to_array_步数(array_steps_df、time_步数、columns_to_array、id_column):
"""
https://www.mikulskibartosz.name/how-to-turn-pandas-data-frame-to-time-series-input-for-rnn/
:param array_steps_df:来自csv的数据帧
:param time_steps:有多少个时间步
:param columns_to_array:要转换为数组的列
:param id_column:标识符将使用什么
:return:按标识符和日期分组在a#观察中的数据
"""
id\u list=array\u steps\u df[id\u column].unique().tolist()
日期\u列表=数组\u步骤\u df['date'].unique().tolist()
主列表=[]
目标_列表=[]
缺少\u计数器=0
计数器总数=0
#抓取日期大小=每次的时间步,并迭代所有时间步
对于范围内的日期(len(日期列表)-时间步长+1):
日期范围测试=日期列表[日期:时间步数+日期]
date_range_df=数组_steps_df.loc[(数组_steps_df['date']=日期_range_test[0])
]
#对于每个id,分别执行此操作,以避免时间序列数据混淆
对于id_列表中的标识符:
#在此处获取id,如果不是id所需的时间步长/观察值,则跳过
日期范围id=日期范围df.loc[日期范围df[id\U列]==标识符]
主目录={}
#如果没有足够的观测数据范围
如果len(日期范围id)!=时间步长:
#不完全需要计数器,除非在调试时出现异常情况,否则目前不会造成任何伤害
缺少计数器+=1
其他:
#为id或股票代码的日期范围内的最后一个日期添加每个循环的目标
目标=数组\u步数\u df['target']\
loc[(数组步数测向['date']==日期范围测试[-1])
&(数组\u步数\u df[id\u列]==标识符)
].iloc[0]
target\u list.append(目标)
计数器总数+=1
#循环遍历dataframe中的每一列
对于列到列数组中的列:
日期范围id值=日期范围id[[列]]值。值
主目录[列]=[]
主目录[列].追加(日期范围id值)
主目录列表。附加(主目录)
#按上面的Id对列进行排序和分组后,将列重做到数组
数组_列表=[]
#对于每一列,遍历数组中的值,为该列创建一个数组,然后附加到列表中
对于列到列数组中的列:
对于idx,枚举中的dic(主列表):
#如果第一个值
如果idx==0:
value\u array\u init=master\u list[0][column]
其他:
值\u数组\u init+=主\u列表[idx][列]
array\u list.append(np.array(value\u array\u init))
#对于数组列表中的每个值,水平堆叠每个值
all_array=np.hstack(数组列表)。重塑((总计数器,
len(列到数组),
时间步长
)
).转置(0,2,1)
target\u array\u all=np.array(target\u列表
).重塑(len(目标列表),
1)
#在以后的几次测试之后,可能会出现这种情况
打印(‘检查数组长度’、len(所有数组)、len(目标数组所有))
返回所有数组,目标数组

我最后重做了一点,使列更具动态性,并使时间序列保持有序,还添加了一个目标数组,以使预测保持有序。对于需要此功能的任何人,以下是功能:

def data_to_array_steps(array_steps_df, time_steps, columns_to_array, id_column):
"""
https: //www.mikulskibartosz.name/ how - to - turn - pandas - data - frame - into - time - series - input - for -rnn /
:param array_steps_df: the dataframe from the csv
:param time_steps: how many time steps
:param columns_to_array: what columns to convert to the array
:param id_column: what is to be used for the identifier
:return: data grouped in a # observations by identifier and date
"""

    id_list = array_steps_df[id_column].unique().tolist()
    date_list = array_steps_df['date'].unique().tolist()

    master_list = []
    target_list = []

    missing_counter = 0
    total_counter = 0

    # grab date size = time steps at a time and iterate through all of them
    for date in range(len(date_list) - time_steps + 1):
        date_range_test = date_list[date:time_steps+date]

        date_range_df = array_steps_df.loc[(array_steps_df['date'] <= date_range_test[-1]) &
                                           (array_steps_df['date'] >= date_range_test[0])
                                          ]

        # for each id do it separately so time series data doesn't get mixed up
        for identifier in id_list:

            # get id in here and then skip if not the required time steps/observations for the id

            date_range_id = date_range_df.loc[date_range_df[id_column] == identifier]

            master_dict = {}

        # if there aren't enough observations for the data range
            if len(date_range_id) != time_steps:

                # dont fully need the counter except in unusual circumstances when debugging it causes no harm for now
                missing_counter += 1

            else:
            # add target each loop through for the last date in the date range for the id or ticker
                target = array_steps_df['target'].\
                         loc[(array_steps_df['date'] == date_range_test[-1])
                           & (array_steps_df[id_column] == identifier)                                     
                            ].iloc[0]

                target_list.append(target)

                total_counter += 1

                # loop through each column in dataframe
                for column in columns_to_array:

                    date_range_id_value = date_range_id[[column]].values

                    master_dict[column] = []
                    master_dict[column].append(date_range_id_value)

                master_list.append(master_dict)

    # redo columns to arrays, after they have been ordered and grouped by Id above
    array_list = []

    # for each column go through the values in the array create an array for the column then append to list
    for column in columns_to_array:

        for idx, dic in enumerate(master_list):
            # init arrays here if the first value
            if idx == 0:
                 value_array_init = master_list[0][column]

            else:
                 value_array_init += master_list[idx][column]

        array_list.append(np.array(value_array_init))

    # for each value in the array list, horizontally stack each value
    all_array = np.hstack(array_list).reshape((total_counter,
                                               len(columns_to_array),
                                               time_steps
                                               )
                                             ).transpose(0, 2, 1)

    target_array_all = np.array(target_list
                                ).reshape(len(target_list),
                                          1)

    # should probably make this an if condition later after a few more tests
    print('check of length of arrays', len(all_array), len(target_array_all))

    return all_array, target_array_all
def data_to_array_步数(array_steps_df、time_步数、columns_to_array、id_column):
""