Python 将序列转换为序列数组_Python_Numpy_Pyspark

Python 将序列转换为序列数组

python numpy pyspark

Python 将序列转换为序列数组,python,numpy,pyspark,Python,Numpy,Pyspark,我有一个pyspark数据框，其中每个唯一ID有30个观察值，如下所示： id time features 1 0 [1,2,3] 1 1 [4,5,6] .. .. .. 1 29 [7,8,9] 2 0 [0,1,2] 2 1 [3,4,5] .. .. .. 2 29 [6,7,8] .. .. .. 我需要做的是创建一个序列数组来输入keras

我有一个pyspark数据框，其中每个唯一ID有30个观察值，如下所示：

id  time  features
 1     0   [1,2,3]
 1     1   [4,5,6]
..    ..        ..
 1    29   [7,8,9]
 2     0   [0,1,2]
 2     1   [3,4,5]
..    ..        ..
 2    29   [6,7,8]
..    ..        ..

我需要做的是创建一个序列数组来输入keras神经网络。例如，假设一个id有以下较小的数据集：

id  time  features
 1     0    [1,2,3]
 1     1    [4,5,6]
 1     2    [7,8,9]

所需的数据格式为：

[[[1,2,3]
  [0,0,0]
  [0,0,0]],
 [[1,2,3],
  [4,5,6],
  [0,0,0]],
 [[1,2,3],
  [4,5,6],
  [7,8,9]]]

我可以使用keras包中的

pad_sequences

函数添加[0,0,0]行，因此我真正需要能够为所有ID创建以下数组

[[[1,2,3]],
 [[1,2,3],
  [4,5,6]],
 [[1,2,3],
  [4,5,6],
  [7,8,9]]]

我能想到的唯一方法是使用循环，比如：

x = []
for i in range(10000):
   user = x_train[i]
   arr = []
   for j in range(30):
      arr.append(user[0:j])
   x.append(arr)

不过，循环解决方案是不可行的。我有904批10000个唯一ID，每个ID有30个观察值。我将一次收集一批数据到numpy阵列中，这样numpy解决方案就可以了。使用RDD的pyspark解决方案将非常棒。使用

映射的东西

也许？

这里有一个numpy解决方案，它创建了所需的输出，包括零。它使用

triu_指数创建“累积时间序列结构”：
输出：
small example [[[[76 53 48]
   [ 0  0  0]
   [ 0  0  0]
   [ 0  0  0]]

  [[76 53 48]
   [46 59 76]
   [ 0  0  0]
   [ 0  0  0]]

  [[76 53 48]
   [46 59 76]
   [62 39 17]
   [ 0  0  0]]

  [[76 53 48]
   [46 59 76]
   [62 39 17]
   [61 90 69]]]


 [[[68 32 20]
   [ 0  0  0]
   [ 0  0  0]
   [ 0  0  0]]

  [[68 32 20]
   [47 11 72]
   [ 0  0  0]
   [ 0  0  0]]

  [[68 32 20]
   [47 11 72]
   [30  3  9]
   [ 0  0  0]]

  [[68 32 20]
   [47 11 72]
   [30  3  9]
   [28 73 78]]]]
time needed for big example 0.2251 secs

你为什么不按照这些思路做点什么呢：
dict1 = {}

for tuple1 in your_collection:
    if tuple1 ['id'] not in dict1:
    ###if we've never seen the id then add a list of lists of feature lists as entry
        dict1 [tuple1['id']] = [[tuple1['features']]]
    else:
        ##if we've seen this ID then take the previous (n-1) 
        ##list of list of features from the current dictionary       
        ##entry, copy its value to a variable, add the current list of
        ##features to this list of lists and finally append this 
        ##updated list back to the entry (which is essentially     
        ##a 3d matrix). So each entry is a 3d list keyed off by id.
        prev_list = dict1[tuple1['id']][-1][:]
        prev_list.append ( tuple1['features'])
        dict1[tuple1['id']].append (prev_list)

这种方法的空间复杂度很低，但如果您处理的是一组有限的空间，则可能会起作用。这很完美，谢谢！我在输出中添加了一个最终的整形：x=x.reshope（x.shape[0]*x.shape[1]，x.shape[1]，x.shape[3]），但这正是我需要的。
dict1 = {}

for tuple1 in your_collection:
    if tuple1 ['id'] not in dict1:
    ###if we've never seen the id then add a list of lists of feature lists as entry
        dict1 [tuple1['id']] = [[tuple1['features']]]
    else:
        ##if we've seen this ID then take the previous (n-1) 
        ##list of list of features from the current dictionary       
        ##entry, copy its value to a variable, add the current list of
        ##features to this list of lists and finally append this 
        ##updated list back to the entry (which is essentially     
        ##a 3d matrix). So each entry is a 3d list keyed off by id.
        prev_list = dict1[tuple1['id']][-1][:]
        prev_list.append ( tuple1['features'])
        dict1[tuple1['id']].append (prev_list)