Python 如何从TensorFlow中的SparsetSensor中选择一行？_Python_Tensorflow_Embedding

Python 如何从TensorFlow中的SparsetSensor中选择一行？

python tensorflow

Python 如何从TensorFlow中的SparsetSensor中选择一行？,python,tensorflow,embedding,Python,Tensorflow,Embedding,比方说，如果我有两个SparseTensors，如下所示： [[1, 0, 0, 0], [2, 0, 0, 0], [1, 2, 0, 0]] 及我想从中提取前两行。我需要非零项的索引和值作为SparseTensors，以便将结果传递给tf.nn.embedded\u lookup\u sparse。我该怎么做我的申请是：我想使用单词嵌入，这在TensorFlow中非常直接。但现在我想使用稀疏嵌入，即：对于普通单词，它们有自己的嵌入。对于稀有词，它们的嵌入是普通词嵌入的稀疏线性组合

比方说，如果我有两个

SparseTensor

s，如下所示：

[[1, 0, 0, 0],
 [2, 0, 0, 0],
 [1, 2, 0, 0]]

及

我想从中提取前两行。我需要非零项的索引和值作为

SparseTensor

s，以便将结果传递给

tf.nn.embedded\u lookup\u sparse

。我该怎么做

我的申请是：我想使用单词嵌入，这在TensorFlow中非常直接。但现在我想使用稀疏嵌入，即：对于普通单词，它们有自己的嵌入。对于稀有词，它们的嵌入是普通词嵌入的稀疏线性组合。所以我需要两本食谱来说明稀疏嵌入是如何组成的。在前面的例子中，食谱上说：对于第一个单词，它的嵌入由它自己的嵌入组成，权重为1.0。第二个词的情况类似。对于最后一个单词，它说：这个单词的嵌入是前两个单词嵌入的线性组合，相应的权重分别为0.3和0.7。我需要提取一行，然后将索引和权重馈送到

tf.nn.embedding\u lookup\u sparse

，以获得最终的嵌入。如何在TensorFlow中实现这一点

或者我需要解决这个问题，例如：预处理我的数据并处理TensorFlow的烹饪书？

我与这里的一位工程师进行了联系，他对这个领域有更多的了解，下面是他传递的信息：

我不确定我们是否有一个有效的实现，但这里是一个使用动态分区和聚集操作的不太理想的实现

def sparse_slice(indices, values, needed_row_ids):
   num_rows = tf.shape(indices)[0]
   partitions = tf.cast(tf.equal(indices[:,0], needed_row_ids), tf.int32)
   rows_to_gather = tf.dynamic_partition(tf.range(num_rows), partitions, 2)[1]
   slice_indices = tf.gather(indices, rows_to_gather)
   slice_values = tf.gather(values, rows_to_gather)
   return slice_indices, slice_values

with tf.Session().as_default():
  indices = tf.constant([[0,0], [1, 0], [2, 0], [2, 1]])
  values = tf.constant([1.0, 1.0, 0.3, 0.7], dtype=tf.float32)
  needed_row_ids = tf.constant([1])
  slice_indices, slice_values = sparse_slice(indices, values, needed_row_ids)
  print(slice_indices.eval(), slice_values.eval())

更新：

工程师也发送了一个示例来帮助处理多行，感谢您指出这一点

def sparse_slice(indices, values, needed_row_ids):
  needed_row_ids = tf.reshape(needed_row_ids, [1, -1])
  num_rows = tf.shape(indices)[0]
  partitions = tf.cast(tf.reduce_any(tf.equal(tf.reshape(indices[:,0], [-1, 1]), needed_row_ids), 1), tf.int32)
  rows_to_gather = tf.dynamic_partition(tf.range(num_rows), partitions, 2)[1]
  slice_indices = tf.gather(indices, rows_to_gather)
  slice_values = tf.gather(values, rows_to_gather)
  return slice_indices, slice_values

with tf.Session().as_default():
  indices = tf.constant([[0,0], [1, 0], [2, 0], [2, 1]])
  values = tf.constant([1.0, 1.0, 0.3, 0.7], dtype=tf.float32)
  needed_row_ids = tf.constant([0, 2])

让

sp

作为2d SparseTensor的名称。您可以首先为要提取的SparsetSensor行创建一个指示符张量，即

mask = tf.concat([tf.constant([True, True]), tf.fill([sp.dense_shape[0] - 2],
    False)], axis=0)

接下来使用tf.gather将其传播到稀疏索引：

mask_sp = tf.gather(mask, sp.indices[:, 0])

最后,

values = tf.boolean_mask(sp.values, mask_sp)
indices = tf.boolean_mask(sp.indices, mask_sp)
dense_shape = [sp.dense_shape[0] - 2, sp.dense_shape[1]]
output_sp = tf.SparseTensor(indices=indices, values=values, dense_shape=dense_shape)

难道它不应该表现得更像这样吗：

此版本将保持所选索引中索引的顺序和频率，因此可以多次选择同一行：

import tensorflow as tf
tf.enable_eager_execution()

def sparse_gather(indices, values, selected_indices, axis=0):
    """
    indices: [[idx_ax0, idx_ax1, idx_ax2, ..., idx_axk], ... []]
    values:  [ value1,                                 , ..., valuen]
    """
    mask = tf.equal(indices[:, axis][tf.newaxis, :], selected_indices[:, tf.newaxis])
    to_select = tf.where(mask)[:, 1]
    return tf.gather(indices, to_select, axis=0), tf.gather(values, to_select, axis=0)


indices = tf.constant([[1, 0], [2, 0], [3, 0], [7, 0]])
values = tf.constant([1.0, 2.0, 3.0, 7.0], dtype=tf.float32)
needed_row_ids = tf.constant([7, 3, 2, 2, 3, 7])
slice_indices, slice_values = sparse_gather(indices, values, needed_row_ids)
print(slice_indices, slice_values)

我尝试了“Pete Warden”的答案，它只适用于小数据。给定具有m个非零元素的sparsetensor A，我们想要取出n行。tf.equal将占用m*n空间，这在我的任务中是不可接受的

我的建议是使用Scipy.sparse而不是tensorflow。详情如下：

从tf、索引和数据中取出所有数据，并形成一个Scipy.sparse。使用coo

如果需要取出行，请使用csr formate。如果你需要取出感冒药，使用csc

A[：，m]

转为首席运营官

转化为tf

哦，我明白了。它巧妙地结合了动态分区和聚集。我们的想法是，不要形成稀疏传感器。相反，只需在稠密的

索引

和

值

数组上执行

动态分区

和

聚集

。这非常有帮助！顺便说一下，这段代码似乎只适用于一行。如何提取多行（即使有重复）？如果我想提取第一、第三和第三行？这意味着，在一个输入句子中，可能存在一些单词多次出现。我知道我可以多次运行

sparse\u slice

，并使用

tf.concat

将它们连接起来，但有更好的方法吗？我终于找到了。我可以对密集的

索引

和

值进行类似的分区。明天我将发布一个解决方案，它大量使用tf.gather。
import tensorflow as tf
tf.enable_eager_execution()

def sparse_gather(indices, values, selected_indices, axis=0):
    """
    indices: [[idx_ax0, idx_ax1, idx_ax2, ..., idx_axk], ... []]
    values:  [ value1,                                 , ..., valuen]
    """
    mask = tf.equal(indices[:, axis][tf.newaxis, :], selected_indices[:, tf.newaxis])
    to_select = tf.where(mask)[:, 1]
    return tf.gather(indices, to_select, axis=0), tf.gather(values, to_select, axis=0)


indices = tf.constant([[1, 0], [2, 0], [3, 0], [7, 0]])
values = tf.constant([1.0, 2.0, 3.0, 7.0], dtype=tf.float32)
needed_row_ids = tf.constant([7, 3, 2, 2, 3, 7])
slice_indices, slice_values = sparse_gather(indices, values, needed_row_ids)
print(slice_indices, slice_values)