Python 从keras数据的数据帧生成numpy数据数组

Python 从keras数据的数据帧生成numpy数据数组,python,pandas,keras,deep-learning,numpy-ndarray,Python,Pandas,Keras,Deep Learning,Numpy Ndarray,这是我一直在思考的一项任务。我有一个包含用户运动特征的数据帧(按用户id),类似于下面的数据帧: >>> df id speed1 speed2 acc1 acc2 label 0 1 19 12 5 2 0 1 1 10 11 9 3 0 2 1 12 10 4 -1 0 3 1 29 13

这是我一直在思考的一项任务。我有一个包含用户运动特征的数据帧(按用户
id
),类似于下面的数据帧:

>>> df
   id  speed1  speed2  acc1  acc2  label
0   1      19      12     5     2      0
1   1      10      11     9     3      0
2   1      12      10     4    -1      0
3   1      29      13     8     4      0
4   1      30      23     9    10      0
5   1      18      11     2    -1      0
6   1      10       6    -3    -2      0
7   2       5       1     0     0      1
8   2       7       2     1     3      1
9   2       6       2     1     0      1
从这个数据帧中,我想通过分割每个用户的(即
id
)记录来生成一个固定长度段的
numpy ndarray
(我应该说数组列表吗?),这样每个段的形状都是
(1,5,4)
,我可以通过以下方式将其输入神经网络:

  • 每个段(因此,
    1
    )由上述数据帧中运动特性
    speed1 speed2 acc1 acc2
    (因此,
    4
    )的五个数组(因此,
    5
    )组成
  • 如果行不能组成五个数组,则剩余的数组用零填充(即零填充)
然后,
标签
列也应该是一个单独的数组,通过在填充段的零填充数组位置复制
标签
的值来匹配新数组的大小

在上述给定的
df
示例中,预期输出为:

>>>input_array
[
   [
     [19 12 5 2]
     [10 11 9 3]
     [12 10 4 -1]
     [29 13 8 4]
     [30 23 9 10]
   ]
 
   [
     [18 11 2 -1]
     [10 6 -3 -2]
     [0  0  0  0]
     [0  0  0  0]
     [0  0  0  0]
   ]
 
   [
     [5 6 -3 -2]
     [7  2  1 3]
     [6  2  1 0]
     [0  0  0 0]
     [0  0  0 0]
   ]
]
  • id=1
    有7行,因此最后3行是零填充的。类似地,
    id=2
    有3行,因此最后2行是零填充的
编辑

我注意到答案中给出的函数有两个bug

  • 该函数在某些情况下引入全零数组
  • 例如:

    df2 = {
        'id': [1,1,1,1,1,1,1,1,1,1,1,1],
    'speed1': [17.63,17.63,0.17,1.41,0.61,0.32,0.18,0.43,0.30,0.46,0.75,0.37],
    'speed2': [0.00,-0.09,1.24,-0.80,-0.29,-0.14,0.25,-0.13,0.16,0.29,-0.38,0.27],
    'acc1': [0.00,0.01,-2.04,0.51,0.15,0.39,-0.38,0.29,0.13,-0.67,0.65,0.52],
    'acc2': [29.03,56.12,18.49,11.85,36.75,27.52,81.08,51.06,19.85,10.76,14.51,24.27],
    'label' : [3,3,3,3,3,3,3,3,3,3,3,3] }
    
    df2 = pd.DataFrame.from_dict(df2)
    
    X , y = transform(df2[:10])
    X
    array([[[[ 1.763e+01,  0.000e+00,  0.000e+00,  2.903e+01],
             [ 1.763e+01, -9.000e-02,  1.000e-02,  5.612e+01],
             [ 1.700e-01,  1.240e+00, -2.040e+00,  1.849e+01],
             [ 1.410e+00, -8.000e-01,  5.100e-01,  1.185e+01],
             [ 6.100e-01, -2.900e-01,  1.500e-01,  3.675e+01]]],
    
    
           [[[ 0.000e+00,  0.000e+00,  0.000e+00,  0.000e+00],
             [ 0.000e+00,  0.000e+00,  0.000e+00,  0.000e+00],
             [ 0.000e+00,  0.000e+00,  0.000e+00,  0.000e+00],
             [ 0.000e+00,  0.000e+00,  0.000e+00,  0.000e+00],
             [ 0.000e+00,  0.000e+00,  0.000e+00,  0.000e+00]]],
    
    
           [[[ 3.200e-01, -1.400e-01,  3.900e-01,  2.752e+01],
             [ 1.800e-01,  2.500e-01, -3.800e-01,  8.108e+01],
             [ 4.300e-01, -1.300e-01,  2.900e-01,  5.106e+01],
             [ 3.000e-01,  1.600e-01,  1.300e-01,  1.985e+01],
             [ 4.600e-01,  2.900e-01, -6.700e-01,  1.076e+01]]]])
    
    请注意函数如何引入一个全零数组作为第二个元素。理想情况下,输出应该只包含第一个和最后一个数组

  • 当传递的df超过10行时,函数失败,出现
    索引不能包含负值的错误
  • 因此,如果你
    df2
    你会得到:

    ---------------------------------------------------------------------------
    
    ValueError                                Traceback (most recent call last)
    
    <ipython-input-71-743489875901> in <module>()
    ----> 1 X , y = transform(df2)
          2 X
    
    2 frames
    
    <ipython-input-55-f6e028a2e8b8> in transform(dataframe, chunk_size)
         24             inpt = np.pad(
         25                 inpt, [(0, chunk_size-len(inpt)),(0, 0)],
    ---> 26                 mode='constant')
         27             # add each inputs split to accumulators
         28             X = np.concatenate([X, inpt[np.newaxis, np.newaxis]], axis=0)
    
    <__array_function__ internals> in pad(*args, **kwargs)
    
    /usr/local/lib/python3.6/dist-packages/numpy/lib/arraypad.py in pad(array, pad_width, mode, **kwargs)
        746 
        747     # Broadcast to shape (array.ndim, 2)
    --> 748     pad_width = _as_pairs(pad_width, array.ndim, as_index=True)
        749 
        750     if callable(mode):
    
    /usr/local/lib/python3.6/dist-packages/numpy/lib/arraypad.py in _as_pairs(x, ndim, as_index)
        517 
        518     if as_index and x.min() < 0:
    --> 519         raise ValueError("index can't contain negative values")
        520 
        521     # Converting the array with `tolist` seems to improve performance
    
    ValueError: index can't contain negative values
    
    ---------------------------------------------------------------------------
    ValueError回溯(最近一次调用上次)
    在()
    ---->1 X,y=变换(df2)
    2 X
    2帧
    转换中(数据帧、块大小)
    24输入=np.pad(
    25输入,[(0,块大小透镜(输入)),(0,0)],
    --->26模式(常数)
    27#将每个分离的输入添加到累加器
    28 X=np.连接([X,输入[np.newaxis,np.newaxis]],轴=0)
    在pad(*args,**kwargs)中
    /pad中的usr/local/lib/python3.6/dist-packages/numpy/lib/arraypad.py(数组、pad_宽度、模式,**kwargs)
    746
    747#广播成型(array.ndim,2)
    -->748 pad_width=_as_pairs(pad_width,array.ndim,as_index=True)
    749
    750如果可调用(模式):
    /usr/local/lib/python3.6/dist-packages/numpy/lib/arraypad.py成对(x,ndim,as_索引)
    517
    518如果as_索引和x.min()<0:
    -->519提升值错误(“索引不能包含负值”)
    520
    521#使用“tolist”转换数组似乎可以提高性能
    ValueError:索引不能包含负值
    
    [编辑]修复了错误。下面的实现现在应该提供所需的输出:

    import pandas as pd
    import numpy as np
    
    df = {
        'id': [1,1,1,1,1,1,1,1,1,1,1,1],
    'speed1': [17.63,17.63,0.17,1.41,0.61,0.32,0.18,0.43,0.30,0.46,0.75,0.37],
    'speed2': [0.00,-0.09,1.24,-0.80,-0.29,-0.14,0.25,-0.13,0.16,0.29,-0.38,0.27],
    'acc1': [0.00,0.01,-2.04,0.51,0.15,0.39,-0.38,0.29,0.13,-0.67,0.65,0.52],
    'acc2': [29.03,56.12,18.49,11.85,36.75,27.52,81.08,51.06,19.85,10.76,14.51,24.27],
    'label' : [3,3,3,3,3,3,3,3,3,3,3,3] }
    
    df = pd.DataFrame.from_dict(df)
    
    def transform(dataframe, chunk_size=5):
        
        grouped = dataframe.groupby('id')
    
        # initialize accumulators
        X, y = np.zeros([0, 1, chunk_size, 4]), np.zeros([0,])
    
        # loop over each group (df[df.id==1] and df[df.id==2])
        for _, group in grouped:
    
            inputs = group.loc[:, 'speed1':'acc2'].values
            label = group.loc[:, 'label'].values[0]
    
            # calculate number of splits
            N = (len(inputs)-1) // chunk_size
    
            if N > 0:
                inputs = np.array_split(
                     inputs, [chunk_size + (chunk_size*i) for i in range(N)])
            else:
                inputs = [inputs]
    
            # loop over splits
            for inpt in inputs:
                inpt = np.pad(
                    inpt, [(0, chunk_size-len(inpt)),(0, 0)], 
                    mode='constant')
                # add each inputs split to accumulators
                X = np.concatenate([X, inpt[np.newaxis, np.newaxis]], axis=0)
                y = np.concatenate([y, label[np.newaxis]], axis=0) 
    
        return X, y
    
    X, y = transform(df)
    
    print('X shape =', X.shape)
    print('X =', X)
    print('Y shape =', y.shape)
    print('Y =', y)
    
    # >> out:
    # X shape = (3, 1, 5, 4)
    # X = [[[[17.63  0.    0.   29.03]
    #    [17.63 -0.09  0.01 56.12]
    #    [ 0.17  1.24 -2.04 18.49]
    #    [ 1.41 -0.8   0.51 11.85]
    #    [ 0.61 -0.29  0.15 36.75]]]
    #
    #
    #  [[[ 0.32 -0.14  0.39 27.52]
    #    [ 0.18  0.25 -0.38 81.08]
    #    [ 0.43 -0.13  0.29 51.06]
    #    [ 0.3   0.16  0.13 19.85]
    #    [ 0.46  0.29 -0.67 10.76]]]
    #
    #
    #  [[[ 0.75 -0.38  0.65 14.51]
    #    [ 0.37  0.27  0.52 24.27]
    #    [ 0.    0.    0.    0.  ]
    #    [ 0.    0.    0.    0.  ]
    #    [ 0.    0.    0.    0.  ]]]]
    # Y shape = (3,)
    # Y = [3. 3. 3.]
    

    我已经更新了上面的代码(输出现在是(3,1,5,4)),谢谢。准备数据输入到conv网络,目标是将输入转换为网格状形状
    (1100,4)
    100个数据点(宽度)的1段(长度)和4个运动特征((通道)。我会尽力提供帮助。注意:我已经更新了两次答案/实现。在第一次更新之后,它仍然存在一些问题/bug,但在第二次更新之后,现在应该可以了。如何使用
    np.array\u split
    (数组在错误的位置被拆分),导致
    chunk\u size-len(inpt)进行拆分存在问题
    为负值(传递给
    np.pad
    的参数
    pad\u width
    (第二个参数),提高了
    ValueError
    )。