Tensorflow 如何使用具有可变长度功能和标签的TF CTC损耗

Tensorflow 如何使用具有可变长度功能和标签的TF CTC损耗,tensorflow,speech-recognition,ctc,Tensorflow,Speech Recognition,Ctc,我想用Tensorflow实现一个CTC丢失的语音识别器。输入特征具有可变长度,因为每个语音都可以具有可变长度。标签也有可变的长度,因为每个转录是不同的。我手动填充特征来创建批次,在我的模型中,我有tf.keras.layers.Masking()层来创建并通过网络传播掩码。我还使用填充创建标签批处理 下面是一个虚拟示例。让我们假设我有两个长度分别为3帧和5帧的句子。每个帧由一个单独的特征表示(通常是13个MFCC,但为了保持简单,我将其减少为一个)。因此,要创建批次I,请在结尾处用0填充短语句

我想用Tensorflow实现一个CTC丢失的语音识别器。输入特征具有可变长度,因为每个语音都可以具有可变长度。标签也有可变的长度,因为每个转录是不同的。我手动填充特征来创建批次,在我的模型中,我有tf.keras.layers.Masking()层来创建并通过网络传播掩码。我还使用填充创建标签批处理

下面是一个虚拟示例。让我们假设我有两个长度分别为3帧和5帧的句子。每个帧由一个单独的特征表示(通常是13个MFCC,但为了保持简单,我将其减少为一个)。因此,要创建批次I,请在结尾处用0填充短语句:

features = np.array([1.5 2.3 4.6 0.0 0.0],
                    [1.7 2.6 3.4 2.3 1.0])                
标签是这些话语的转录。假设长度分别为2和3。标签批次形状将为[2,3,26],其中批次大小中的2,3是最大长度,26是英文字符数(一个热编码)

模型是:

input_ = tf.keras.Input(shape=(None,1))
x = tf.keras.layers.Masking()(input_)
x = tf.keras.layers.GRU(26, return_sequences=True)(input_)
output_ = tf.keras.layers.Softmax(axis=-1)(x)
model = tf.keras.Model(input_,output_)
损失函数类似于:

def ctc_loss(y_true, y_pred):
   # Do something here to get logit_length and label_length?
   # ...
   loss = tf.keras.backend.ctc_batch_cost(y_true,y_pred,logit_length,label_length)
我的问题是如何获得logit_长度和label_长度。我假设logit_长度编码在掩码中,但是如果我做y_pred.\u keras_掩码,结果是无。对于标签长度,信息是在张量本身,但我不确定获得它的最有效方法

谢谢

更新:

根据Tou You的回答,我使用tf.math.count_nonzero获得标签长度,并将logit_长度设置为logit层的长度

因此损失函数中的形状为(批量大小=10):

当然,y_true和y_pred的“None”并不相同,因为一个是批处理的最大字符串长度,另一个是批处理的最大时间帧数。但是,当我使用这些参数调用model.fit()和loss tf.keras.backend.ctc\u batch\u cost()时,我得到了错误:

Traceback (most recent call last):
  File "train.py", line 164, in <module>
    model.fit(dataset, batch_size=batch_size, epochs=10)
  File "/home/pablo/miniconda3/envs/lightvoice/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py", line 66, in _method_wrapper
    return method(self, *args, **kwargs)
  File "/home/pablo/miniconda3/envs/lightvoice/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py", line 848, in fit
    tmp_logs = train_function(iterator)
  File "/home/pablo/miniconda3/envs/lightvoice/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 580, in __call__
    result = self._call(*args, **kwds)
  File "/home/pablo/miniconda3/envs/lightvoice/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 644, in _call
    return self._stateless_fn(*args, **kwds)
  File "/home/pablo/miniconda3/envs/lightvoice/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 2420, in __call__
    return graph_function._filtered_call(args, kwargs)  # pylint: disable=protected-access
  File "/home/pablo/miniconda3/envs/lightvoice/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 1661, in _filtered_call
    return self._call_flat(
  File "/home/pablo/miniconda3/envs/lightvoice/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 1745, in _call_flat
    return self._build_call_outputs(self._inference_function.call(
  File "/home/pablo/miniconda3/envs/lightvoice/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 593, in call
    outputs = execute.execute(
  File "/home/pablo/miniconda3/envs/lightvoice/lib/python3.8/site-packages/tensorflow/python/eager/execute.py", line 59, in quick_execute
    tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
  (0) Invalid argument:  Incompatible shapes: [10,92] vs. [10,876]
         [[node Equal (defined at train.py:164) ]]
  (1) Invalid argument:  Incompatible shapes: [10,92] vs. [10,876]
         [[node Equal (defined at train.py:164) ]]
         [[ctc_loss/Log/_62]]
0 successful operations.
0 derived errors ignored. [Op:__inference_train_function_3156]

Function call stack:
train_function -> train_function
回溯(最近一次呼叫最后一次):
文件“train.py”,第164行,在
model.fit(数据集,批量大小=批量大小,历代数=10)
文件“/home/pablo/miniconda3/envs/lightvoice/lib/python3.8/site packages/tensorflow/python/keras/engine/training.py”,第66行,在方法包装中
返回方法(self、*args、**kwargs)
文件“/home/pablo/miniconda3/envs/lightvoice/lib/python3.8/site packages/tensorflow/python/keras/engine/training.py”,第848行
tmp_logs=训练函数(迭代器)
文件“/home/pablo/miniconda3/envs/lightvoice/lib/python3.8/site packages/tensorflow/python/eager/def_function.py”,第580行,in_u调用__
结果=自身调用(*args,**kwds)
文件“/home/pablo/miniconda3/envs/lightvoice/lib/python3.8/site packages/tensorflow/python/eager/def_function.py”,第644行,in_call
返回self.\u无状态\u fn(*args,**kwds)
文件“/home/pablo/miniconda3/envs/lightvoice/lib/python3.8/site packages/tensorflow/python/eager/function.py”,第2420行,在调用中__
返回图形\函数。\过滤\调用(args,kwargs)\ pylint:disable=受保护的访问
文件“/home/pablo/miniconda3/envs/lightvoice/lib/python3.8/site packages/tensorflow/python/eager/function.py”,第1661行,在调用中
返回自我。\你呼叫公寓(
文件“/home/pablo/miniconda3/envs/lightvoice/lib/python3.8/site packages/tensorflow/python/eager/function.py”,第1745行,在调用平面中
返回self.\u构建\u调用\u输出(self.\u推断\u函数.call(
文件“/home/pablo/miniconda3/envs/lightvoice/lib/python3.8/site packages/tensorflow/python/eager/function.py”,第593行,调用中
输出=execute.execute(
文件“/home/pablo/miniconda3/envs/lightvoice/lib/python3.8/site packages/tensorflow/python/eager/execute.py”,第59行,快速执行
张量=pywrap\u tfe.tfe\u Py\u Execute(ctx.\u句柄、设备名称、操作名称、,
tensorflow.python.framework.errors\u impl.InvalidArgumentError:找到2个根错误。
(0)无效参数:不兼容的形状:[10,92]与[10876]
[[节点相等(在序列py:164处定义)]]
(1) 无效参数:不兼容的形状:[10,92]与[10876]
[[节点相等(在序列py:164处定义)]]
[[ctc_损失/日志/_62]]
0成功的操作。
忽略0个派生错误。[Op:uuuu推断_u序列_u函数_3156]
函数调用堆栈:
列车功能->列车功能

它似乎在抱怨y_true(92)的长度与y_pred(876)的长度不一样,我认为不应该如此。我缺少什么呢?

至少对于Tensorflow的最新版本(2.2及以上),Softmax层支持掩蔽,掩蔽值的输出不是零,而只是重复先前的值

features = np.array([[1.5, 2.3 ,4.6, 0.0 ,0.0],
                [1.7, 2.6, 3.4 ,2.3 ,1.0]])

input_ = tf.keras.Input(shape=(None,1))
x = tf.keras.layers.Masking()(input_)

x = tf.keras.layers.GRU(2, return_sequences=True)(x)

output_ = tf.keras.layers.Softmax(axis=-1)(x)

model = tf.keras.Model(input_,output_)

r = model(features)
print(r)
 
第一个样本的输出具有与掩模对应的重复值:

<tf.Tensor: shape=(2, 5, 2), dtype=float32, numpy=array([[[0.53308547, 0.46691453],
    [0.5477166 , 0.45228338],
    [0.55216545, 0.44783455],
    [0.55216545, 0.44783455],
    [0.55216545, 0.44783455]],

   [[0.532052  , 0.46794805],
    [0.54557794, 0.454422  ],
    [0.55263203, 0.44736794],
    [0.56076777, 0.4392322 ],
    [0.5722393 , 0.42776066]]], dtype=float32)>
您可以从get_mask tensor值中提取标签长度:

   <tf.Tensor: shape=(2, 5), dtype=bool, numpy=array([[ True,  
    True,  True, False, False],
   [ True,  True,  True,  True,  True]])>
对于logit_length的值,我看到的所有实现都只返回时间步长,因此logit_length可以是:

logit_length = tf.ones(shape = (your_batch_size ,1 ) * time_step
或者,您可以使用掩码张量仅获取无掩码时间步长:

logit_length = tf.reshape(tf.reduce_sum( 
        tf.cast(y_pred._keras_mask,tf.float32),axis=1),(your_batch_size,-1) ) 
    
这是一个完整的示例:

features = np.array([[1.5, 2.3 ,4.6, 0.0 ,0.0],
                [1.5, 2.3 ,4.6, 2.0 ,1.0]]).reshape(2,5,1)  
labels = np.array([[1., 2. ,3., 0. ,0.],
               [1., 2. ,3., 2. ,1.]]).reshape(2,5 ) 

input_ = tf.keras.Input(shape=(5,1))
x = tf.keras.layers.Masking()(input_)
x = tf.keras.layers.GRU(5, return_sequences=True)(x)# 5 is the number of classes + blank .(in your case == 26 + 1)
output_ =  tf.keras.layers.Softmax(axis = -1)(x) 

model = tf.keras.Model(input_,output_)


def ctc_loss(y_true, y_pred):

  label_length = tf.math.count_nonzero(y_true, axis=-1, keepdims=True) 
  logit_length = tf.reshape(tf.reduce_sum(
                 tf.cast(y_pred._keras_mask,tf.float32),axis=1),(2,-1) ) 
                      
  loss =tf.keras.backend.ctc_batch_cost(y_true,y_pred,logit_length,
                  label_length)
  return  tf.reduce_mean(loss)

model.compile(loss =ctc_loss , optimizer = 'adam')
model.fit(features , labels ,epoch = 10)

如果用零填充标签张量,则可以通过计算张量中与零不同的值来获得长度:label_length=tf.math.count_nonzero(y_true,axis=-1,keepdims=true)不使用“mask”的输出层作为下一层的输入!我明白了,那么logit_长度呢?对于我也用零填充的功能,但是经过网络后,它们不再是零,所以我不能这样做。对于屏蔽值,输出将不会是零,但您将看到输出只是重复而没有任何更改。在输出层(softmax)必须等于26+1(英文字符数+空白)。谢谢,我遵循了这一点,我发现了另一个问题,我更新了主要问题。
logit_length = tf.ones(shape = (your_batch_size ,1 ) * time_step
logit_length = tf.reshape(tf.reduce_sum( 
        tf.cast(y_pred._keras_mask,tf.float32),axis=1),(your_batch_size,-1) ) 
    
features = np.array([[1.5, 2.3 ,4.6, 0.0 ,0.0],
                [1.5, 2.3 ,4.6, 2.0 ,1.0]]).reshape(2,5,1)  
labels = np.array([[1., 2. ,3., 0. ,0.],
               [1., 2. ,3., 2. ,1.]]).reshape(2,5 ) 

input_ = tf.keras.Input(shape=(5,1))
x = tf.keras.layers.Masking()(input_)
x = tf.keras.layers.GRU(5, return_sequences=True)(x)# 5 is the number of classes + blank .(in your case == 26 + 1)
output_ =  tf.keras.layers.Softmax(axis = -1)(x) 

model = tf.keras.Model(input_,output_)


def ctc_loss(y_true, y_pred):

  label_length = tf.math.count_nonzero(y_true, axis=-1, keepdims=True) 
  logit_length = tf.reshape(tf.reduce_sum(
                 tf.cast(y_pred._keras_mask,tf.float32),axis=1),(2,-1) ) 
                      
  loss =tf.keras.backend.ctc_batch_cost(y_true,y_pred,logit_length,
                  label_length)
  return  tf.reduce_mean(loss)

model.compile(loss =ctc_loss , optimizer = 'adam')
model.fit(features , labels ,epoch = 10)