Python 将序列转换为序列数组

Python 将序列转换为序列数组,python,numpy,pyspark,Python,Numpy,Pyspark,我有一个pyspark数据框,其中每个唯一ID有30个观察值,如下所示: id time features 1 0 [1,2,3] 1 1 [4,5,6] .. .. .. 1 29 [7,8,9] 2 0 [0,1,2] 2 1 [3,4,5] .. .. .. 2 29 [6,7,8] .. .. .. 我需要做的是创建一个序列数组来输入keras

我有一个pyspark数据框,其中每个唯一ID有30个观察值,如下所示:

id  time  features
 1     0   [1,2,3]
 1     1   [4,5,6]
..    ..        ..
 1    29   [7,8,9]
 2     0   [0,1,2]
 2     1   [3,4,5]
..    ..        ..
 2    29   [6,7,8]
..    ..        ..
我需要做的是创建一个序列数组来输入keras神经网络。例如,假设一个id有以下较小的数据集:

id  time  features
 1     0    [1,2,3]
 1     1    [4,5,6]
 1     2    [7,8,9]
所需的数据格式为:

[[[1,2,3]
  [0,0,0]
  [0,0,0]],
 [[1,2,3],
  [4,5,6],
  [0,0,0]],
 [[1,2,3],
  [4,5,6],
  [7,8,9]]]
我可以使用keras包中的
pad_sequences
函数添加[0,0,0]行,因此我真正需要能够为所有ID创建以下数组

[[[1,2,3]],
 [[1,2,3],
  [4,5,6]],
 [[1,2,3],
  [4,5,6],
  [7,8,9]]]
我能想到的唯一方法是使用循环,比如:

x = []
for i in range(10000):
   user = x_train[i]
   arr = []
   for j in range(30):
      arr.append(user[0:j])
   x.append(arr)

不过,循环解决方案是不可行的。我有904批10000个唯一ID,每个ID有30个观察值。我将一次收集一批数据到numpy阵列中,这样numpy解决方案就可以了。使用RDD的pyspark解决方案将非常棒。使用
映射的东西
也许?

这里有一个numpy解决方案,它创建了所需的输出,包括零。 它使用
triu_指数
创建“累积时间序列结构”:

输出:

small example [[[[76 53 48]
   [ 0  0  0]
   [ 0  0  0]
   [ 0  0  0]]

  [[76 53 48]
   [46 59 76]
   [ 0  0  0]
   [ 0  0  0]]

  [[76 53 48]
   [46 59 76]
   [62 39 17]
   [ 0  0  0]]

  [[76 53 48]
   [46 59 76]
   [62 39 17]
   [61 90 69]]]


 [[[68 32 20]
   [ 0  0  0]
   [ 0  0  0]
   [ 0  0  0]]

  [[68 32 20]
   [47 11 72]
   [ 0  0  0]
   [ 0  0  0]]

  [[68 32 20]
   [47 11 72]
   [30  3  9]
   [ 0  0  0]]

  [[68 32 20]
   [47 11 72]
   [30  3  9]
   [28 73 78]]]]
time needed for big example 0.2251 secs

你为什么不按照这些思路做点什么呢:

dict1 = {}

for tuple1 in your_collection:
    if tuple1 ['id'] not in dict1:
    ###if we've never seen the id then add a list of lists of feature lists as entry
        dict1 [tuple1['id']] = [[tuple1['features']]]
    else:
        ##if we've seen this ID then take the previous (n-1) 
        ##list of list of features from the current dictionary       
        ##entry, copy its value to a variable, add the current list of
        ##features to this list of lists and finally append this 
        ##updated list back to the entry (which is essentially     
        ##a 3d matrix). So each entry is a 3d list keyed off by id.
        prev_list = dict1[tuple1['id']][-1][:]
        prev_list.append ( tuple1['features'])
        dict1[tuple1['id']].append (prev_list)

这种方法的空间复杂度很低,但如果您处理的是一组有限的空间,则可能会起作用。

这很完美,谢谢!我在输出中添加了一个最终的整形:x=x.reshope(x.shape[0]*x.shape[1],x.shape[1],x.shape[3]),但这正是我需要的。
dict1 = {}

for tuple1 in your_collection:
    if tuple1 ['id'] not in dict1:
    ###if we've never seen the id then add a list of lists of feature lists as entry
        dict1 [tuple1['id']] = [[tuple1['features']]]
    else:
        ##if we've seen this ID then take the previous (n-1) 
        ##list of list of features from the current dictionary       
        ##entry, copy its value to a variable, add the current list of
        ##features to this list of lists and finally append this 
        ##updated list back to the entry (which is essentially     
        ##a 3d matrix). So each entry is a 3d list keyed off by id.
        prev_list = dict1[tuple1['id']][-1][:]
        prev_list.append ( tuple1['features'])
        dict1[tuple1['id']].append (prev_list)