Python 将序列转换为序列数组
我有一个pyspark数据框,其中每个唯一ID有30个观察值,如下所示:Python 将序列转换为序列数组,python,numpy,pyspark,Python,Numpy,Pyspark,我有一个pyspark数据框,其中每个唯一ID有30个观察值,如下所示: id time features 1 0 [1,2,3] 1 1 [4,5,6] .. .. .. 1 29 [7,8,9] 2 0 [0,1,2] 2 1 [3,4,5] .. .. .. 2 29 [6,7,8] .. .. .. 我需要做的是创建一个序列数组来输入keras
id time features
1 0 [1,2,3]
1 1 [4,5,6]
.. .. ..
1 29 [7,8,9]
2 0 [0,1,2]
2 1 [3,4,5]
.. .. ..
2 29 [6,7,8]
.. .. ..
我需要做的是创建一个序列数组来输入keras神经网络。例如,假设一个id有以下较小的数据集:
id time features
1 0 [1,2,3]
1 1 [4,5,6]
1 2 [7,8,9]
所需的数据格式为:
[[[1,2,3]
[0,0,0]
[0,0,0]],
[[1,2,3],
[4,5,6],
[0,0,0]],
[[1,2,3],
[4,5,6],
[7,8,9]]]
我可以使用keras包中的pad_sequences
函数添加[0,0,0]行,因此我真正需要能够为所有ID创建以下数组
[[[1,2,3]],
[[1,2,3],
[4,5,6]],
[[1,2,3],
[4,5,6],
[7,8,9]]]
我能想到的唯一方法是使用循环,比如:
x = []
for i in range(10000):
user = x_train[i]
arr = []
for j in range(30):
arr.append(user[0:j])
x.append(arr)
不过,循环解决方案是不可行的。我有904批10000个唯一ID,每个ID有30个观察值。我将一次收集一批数据到numpy阵列中,这样numpy解决方案就可以了。使用RDD的pyspark解决方案将非常棒。使用
映射的东西
也许?这里有一个numpy解决方案,它创建了所需的输出,包括零。
它使用triu_指数创建“累积时间序列结构”:
输出:
small example [[[[76 53 48]
[ 0 0 0]
[ 0 0 0]
[ 0 0 0]]
[[76 53 48]
[46 59 76]
[ 0 0 0]
[ 0 0 0]]
[[76 53 48]
[46 59 76]
[62 39 17]
[ 0 0 0]]
[[76 53 48]
[46 59 76]
[62 39 17]
[61 90 69]]]
[[[68 32 20]
[ 0 0 0]
[ 0 0 0]
[ 0 0 0]]
[[68 32 20]
[47 11 72]
[ 0 0 0]
[ 0 0 0]]
[[68 32 20]
[47 11 72]
[30 3 9]
[ 0 0 0]]
[[68 32 20]
[47 11 72]
[30 3 9]
[28 73 78]]]]
time needed for big example 0.2251 secs
你为什么不按照这些思路做点什么呢:
dict1 = {}
for tuple1 in your_collection:
if tuple1 ['id'] not in dict1:
###if we've never seen the id then add a list of lists of feature lists as entry
dict1 [tuple1['id']] = [[tuple1['features']]]
else:
##if we've seen this ID then take the previous (n-1)
##list of list of features from the current dictionary
##entry, copy its value to a variable, add the current list of
##features to this list of lists and finally append this
##updated list back to the entry (which is essentially
##a 3d matrix). So each entry is a 3d list keyed off by id.
prev_list = dict1[tuple1['id']][-1][:]
prev_list.append ( tuple1['features'])
dict1[tuple1['id']].append (prev_list)
这种方法的空间复杂度很低,但如果您处理的是一组有限的空间,则可能会起作用。这很完美,谢谢!我在输出中添加了一个最终的整形:x=x.reshope(x.shape[0]*x.shape[1],x.shape[1],x.shape[3]),但这正是我需要的。
dict1 = {}
for tuple1 in your_collection:
if tuple1 ['id'] not in dict1:
###if we've never seen the id then add a list of lists of feature lists as entry
dict1 [tuple1['id']] = [[tuple1['features']]]
else:
##if we've seen this ID then take the previous (n-1)
##list of list of features from the current dictionary
##entry, copy its value to a variable, add the current list of
##features to this list of lists and finally append this
##updated list back to the entry (which is essentially
##a 3d matrix). So each entry is a 3d list keyed off by id.
prev_list = dict1[tuple1['id']][-1][:]
prev_list.append ( tuple1['features'])
dict1[tuple1['id']].append (prev_list)