Tensorflow 在tfrecord文件中写入和读取SparSetSensor

Tensorflow 在tfrecord文件中写入和读取SparSetSensor,tensorflow,sparse-matrix,tfrecord,Tensorflow,Sparse Matrix,Tfrecord,有可能做到优雅吗 现在我唯一能想到的就是将SparsetSensor的索引(tf.int64)、值(tf.float32)和形状(tf.int64)保存在3个单独的特性中(前两个是VarLenFeature,最后一个是FixedLenFeature)。这看起来真的很麻烦 任何建议都将不胜感激 更新1 我下面的答案不适合构建计算图(b/c稀疏张量的内容必须通过sess.run()提取,如果重复调用,会花费大量时间。) 受此启发,我想也许我们可以得到由tf.serialize\u sparse生成的

有可能做到优雅吗

现在我唯一能想到的就是将SparsetSensor的索引(tf.int64)、值(tf.float32)和形状(tf.int64)保存在3个单独的特性中(前两个是VarLenFeature,最后一个是FixedLenFeature)。这看起来真的很麻烦

任何建议都将不胜感激

更新1 我下面的答案不适合构建计算图(b/c稀疏张量的内容必须通过sess.run()提取,如果重复调用,会花费大量时间。)


受此启发,我想也许我们可以得到由
tf.serialize\u sparse
生成的字节,以便稍后我们可以使用
tf.deserialize\u many\u sparse
恢复SparseTensor。但是
tf.serialize\u sparse
不是在纯python中实现的(它调用外部函数
SerializeSparse
),这意味着我们仍然需要使用
sess.run()
来获取字节。我如何获得纯python版本的
?谢谢。

由于Tensorflow目前只支持tfrecord中的3种类型:Float、Int64和Bytes,并且SparsetSensor通常有多个类型,因此我的解决方案是使用
Pickle
将SparseTensor转换为字节

下面是一个示例代码:

import tensorflow as tf
import pickle
import numpy as np
from scipy.sparse import csr_matrix

#---------------------------------#
# Write to a tfrecord file

# create two sparse matrices (simulate the values from .eval() of SparseTensor)
a = csr_matrix(np.arange(12).reshape((4,3)))
b = csr_matrix(np.random.rand(20).reshape((5,4)))

# convert them to pickle bytes
p_a = pickle.dumps(a)
p_b = pickle.dumps(b)

# put the bytes in context_list and feature_list
## save p_a in context_lists 
context_lists = tf.train.Features(feature={
    'context_a': tf.train.Feature(bytes_list=tf.train.BytesList(value=[p_a]))
    })
## save p_b as a one element sequence in feature_lists
p_b_features = [tf.train.Feature(bytes_list=tf.train.BytesList(value=[p_b]))]
feature_lists = tf.train.FeatureLists(feature_list={
    'features_b': tf.train.FeatureList(feature=p_b_features)
    })

# create the SequenceExample
SeqEx = tf.train.SequenceExample(
    context = context_lists,
    feature_lists = feature_lists
    )
SeqEx_serialized = SeqEx.SerializeToString()

# write to a tfrecord file
tf_FWN = 'test_pickle1.tfrecord'
tf_writer1 = tf.python_io.TFRecordWriter(tf_FWN)
tf_writer1.write(SeqEx_serialized)
tf_writer1.close()

#---------------------------------#
# Read from the tfrecord file

# first, define the parse function
def _parse_SE_test_pickle1(in_example_proto):
    context_features = {
        'context_a': tf.FixedLenFeature([], dtype=tf.string)
        }
    sequence_features = {
        'features_b': tf.FixedLenSequenceFeature([1], dtype=tf.string)
        }
    context, sequence = tf.parse_single_sequence_example(
      in_example_proto, 
      context_features=context_features,
      sequence_features=sequence_features
      )
    p_a_tf = context['context_a']
    p_b_tf = sequence['features_b']

    return tf.tuple([p_a_tf, p_b_tf])

# use the Dataset API to read
dataset = tf.data.TFRecordDataset(tf_FWN)
dataset = dataset.map(_parse_SE_test_pickle1)
dataset = dataset.batch(1)
iterator = dataset.make_initializable_iterator()
next_element = iterator.get_next()

sess = tf.InteractiveSession()
sess.run(tf.global_variables_initializer())
sess.run(iterator.initializer)

[p_a_bat, p_b_bat] = sess.run(next_element)

# 1st index refers to batch, 2nd and 3rd indices refers to the sequence position (only for b)
rec_a = pickle.loads(p_a_bat[0])
rec_b = pickle.loads(p_b_bat[0][0][0])

# check whether the recovered the same as the original ones.
assert((rec_a - a).nnz == 0)
assert((rec_b - b).nnz == 0)

# print the contents
print("\n------ a -------")
print(a.todense())
print("\n------ rec_a -------")
print(rec_a.todense())
print("\n------ b -------")
print(b.todense())
print("\n------ rec_b -------")
print(rec_b.todense())
以下是我得到的:

------ a -------
[[ 0  1  2]
 [ 3  4  5]
 [ 6  7  8]
 [ 9 10 11]]

------ rec_a -------
[[ 0  1  2]
 [ 3  4  5]
 [ 6  7  8]
 [ 9 10 11]]

------ b -------
[[ 0.88612402  0.51438017  0.20077887  0.20969243]
 [ 0.41762425  0.47394715  0.35596051  0.96074408]
 [ 0.35491739  0.0761953   0.86217511  0.45796474]
 [ 0.81253723  0.57032448  0.94959189  0.10139615]
 [ 0.92177499  0.83519464  0.96679833  0.41397829]]

------ rec_b -------
[[ 0.88612402  0.51438017  0.20077887  0.20969243]
 [ 0.41762425  0.47394715  0.35596051  0.96074408]
 [ 0.35491739  0.0761953   0.86217511  0.45796474]
 [ 0.81253723  0.57032448  0.94959189  0.10139615]
 [ 0.92177499  0.83519464  0.96679833  0.41397829]]

我遇到了在TFRecord文件中写入和读取稀疏张量的问题,我在网上发现了很少的相关信息

正如您所建议的,一种解决方案是将SparsetSensor的索引、值和形状存储在3个单独的功能中,我们将对此进行讨论。这看起来既不高效也不优雅

我有一个工作示例(使用
tensorflow2.0.0.alpha0
)。 也许不是最优雅的,但似乎很管用

将tensorflow导入为tf
将numpy作为np导入
#示例数据
st_1=tf.SparseTensor(指数=[[0,0],[1,2]],值=[1,2],密集形状=[3,4])
st_2=tf.SparseTensor(指数=[[0,1],[2,0],[3,3],[3,9,5],密集_形状=[4,4])
稀疏张量=[st_1,st_2]
#将稀疏张量序列化为字节字符串数组
序列化的稀疏张量=[tf.io.serialize\u sparse(st).numpy()表示稀疏张量中的st]
#写入TFR记录
使用tf.io.TFRecordWriter('sparse_example.tfrecord')作为tfwriter:
对于序列化_稀疏_张量中的sst:
稀疏示例=tf.train.example(特征=
火车特征=
{'u_张量':
tf.train.Feature(字节列表=tf.train.BytesList(值=sst))
}))
#将每个示例追加到tfrecord中
tfwriter.write(稀疏\u示例.SerializeToString())
def parse_fn(数据元素):
features={'sparse_tensor':tf.io.FixedLenFeature([3],tf.string)}
parsed=tf.io.parse_单个_示例(data_元素,features=features)
#反序列化\u many\u sparse()要求维度为[N,3],因此我们添加了一个带有expand\u dims的维度
已解析['sparse_tensor']=tf.展开(已解析['sparse_tensor'],轴=0)
#反序列化稀疏张量
解析['sparse\u tensor']=tf.io.反序列化\u many\u sparse(解析['sparse\u tensor'],dtype=tf.int32)
#从稀疏转换为稠密
已解析['sparse\u tensor']=tf.sparse.to\u dense(已解析['sparse\u tensor'])
#移除额外的维度[1,3]->[3]
解析的['sparse\u tensor']=tf.squence(解析的['sparse\u tensor'])
返回解析
#从TFR记录中读取
dataset=tf.data.TFRecordDataset(['sparse\u example.tfrecord']))
dataset=dataset.map(parse_fn)
#Pad和batch数据集
dataset=dataset.padded_批处理(2,padded_形状={'sparse_tensor':[None,None]})
数据集.\uuuu iter\uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu
这将产生:

{'sparse_tensor':}

是否可以将轴=0
添加到tf.squence()中,以确保只删除展开的尺寸?