Python SpeechBrain:使用csv的数据IO_准备功能
我目前正在学习ASRfromScratch教程,但我正在努力使它与流畅的语音数据集一起工作。我能够毫无问题地完成标记器部分和语言模型部分,但我正在努力完成SpeechRecognitor部分。我修改了dataio_prepare函数,但我不确定它是否正确:Python SpeechBrain:使用csv的数据IO_准备功能,python,pytorch,speech-recognition,pytorch-dataloader,Python,Pytorch,Speech Recognition,Pytorch Dataloader,我目前正在学习ASRfromScratch教程,但我正在努力使它与流畅的语音数据集一起工作。我能够毫无问题地完成标记器部分和语言模型部分,但我正在努力完成SpeechRecognitor部分。我修改了dataio_prepare函数,但我不确定它是否正确: """This function prepares the datasets to be used in the brain class. It also defines the data pro
"""This function prepares the datasets to be used in the brain class.
It also defines the data processing pipeline through user-defined functions.
Arguments
---------
hparams : dict
This dictionary is loaded from the `train.yaml` file, and it includes
all the hyperparameters needed for dataset construction and loading.
Returns
-------
datasets : dict
Dictionary containing "train", "valid", and "test" keys that correspond
to the DynamicItemDataset objects.
"""
# Define audio pipeline. In this case, we simply read the path contained
# in the variable wav with the audio reader.
@sb.utils.data_pipeline.takes("path")
@sb.utils.data_pipeline.provides("sig")
def audio_pipeline(path):
"""Load the audio signal. This is done on the CPU in the `collate_fn`."""
sig = sb.dataio.dataio.read_audio('../fluent_speech_commands_dataset/' + path)
return sig
# Define text processing pipeline. We start from the raw text and then
# encode it using the tokenizer. The tokens with BOS are used for feeding
# decoder during training, the tokens with EOS for computing the cost function.
# The tokens without BOS or EOS is for computing CTC loss.
@sb.utils.data_pipeline.takes("transcription")
@sb.utils.data_pipeline.provides(
"words", "tokens_list", "tokens_bos", "tokens_eos", "tokens"
)
def text_pipeline(transcription):
"""Processes the transcriptions to generate proper labels"""
yield transcription
tokens_list = hparams["tokenizer"].encode_as_ids(transcription)
yield tokens_list
tokens_bos = torch.LongTensor([hparams["bos_index"]] + (tokens_list))
yield tokens_bos
tokens_eos = torch.LongTensor(tokens_list + [hparams["eos_index"]])
yield tokens_eos
tokens = torch.LongTensor(tokens_list)
yield tokens
# Define datasets from json data manifest file
# Define datasets sorted by ascending lengths for efficiency
datasets = {}
data_folder = hparams["data_folder"]
for dataset in ["train", "valid", "test"]:
datasets[dataset] = sb.dataio.dataset.DynamicItemDataset.from_csv(
csv_path = hparams[f"{dataset}_annotation"],
replacements={"data_root": data_folder},
dynamic_items=[audio_pipeline, text_pipeline],
output_keys=[
"id",
"sig",
"words",
"tokens_bos",
"tokens_eos",
"tokens",
],
)
hparams[f"{dataset}_dataloader_opts"]["shuffle"] = False
# Sorting training data with ascending order makes the code much
# faster because we minimize zero-padding. In most of the cases, this
# does not harm the performance.
if hparams["sorting"] == "ascending":
datasets["train"] = datasets["train"].filtered_sorted(sort_key="length")
hparams["train_dataloader_opts"]["shuffle"] = False
elif hparams["sorting"] == "descending":
datasets["train"] = datasets["train"].filtered_sorted(
sort_key="length", reverse=True
)
hparams["train_dataloader_opts"]["shuffle"] = False
elif hparams["sorting"] == "random":
hparams["train_dataloader_opts"]["shuffle"] = True
pass
else:
raise NotImplementedError(
"sorting must be random, ascending or descending"
)
return datasets
要澄清的是,.csv文件如下所示:
ID,path,speakerId,transcription,action,object,location
0,wavs/speakers/2BqVo8kVB2Skwgyb/0a3129c0-4474-11e9-a9a5-5dbec3b8816a.wav,2BqVo8kVB2Skwgyb,Change language,change language,none,none
1,wavs/speakers/2BqVo8kVB2Skwgyb/0ee42a80-4474-11e9-a9a5-5dbec3b8816a.wav,2BqVo8kVB2Skwgyb,Resume,activate,music,none
2,wavs/speakers/2BqVo8kVB2Skwgyb/144d5be0-4474-11e9-a9a5-5dbec3b8816a.wav,2BqVo8kVB2Skwgyb,Turn the lights on,activate,lights,none
3,wavs/speakers/2BqVo8kVB2Skwgyb/1811b6e0-4474-11e9-a9a5-5dbec3b8816a.wav,2BqVo8kVB2Skwgyb,Switch on the lights,activate,lights,none
# ############################################################################
# Model: E2E ASR with attention-based ASR
# Encoder: CRDNN
# Decoder: GRU + beamsearch + RNNLM
# Tokens: 500 BPE
# losses: CTC+ NLL
# Training: mini-librispeech
# Pre-Training: librispeech 960h
# Authors: Ju-Chieh Chou, Mirco Ravanelli, Abdel Heba, Peter Plantinga, Samuele Cornell 2020
# # ############################################################################
# Seed needs to be set at top of yaml, before objects with parameters are instantiated
seed: 42
__set_seed: !apply:torch.manual_seed [!ref <seed>]
# If you plan to train a system on an HPC cluster with a big dataset,
# we strongly suggest doing the following:
# 1- Compress the dataset in a single tar or zip file.
# 2- Copy your dataset locally (i.e., the local disk of the computing node).
# 3- Uncompress the dataset in the local folder.
# 4- Set data_folder with the local path
# Reading data from the local disk of the compute node (e.g. $SLURM_TMPDIR with SLURM-based clusters) is very important.
# It allows you to read the data much faster without slowing down the shared filesystem.
data_folder: ../fluent_speech_commands_dataset # In this case, data will be automatically downloaded here.
data_folder_rirs: ../noise # noise/ris dataset will automatically be downloaded here
output_folder: !ref results/CRDNN_BPE_960h_LM/<seed>
wer_file: !ref <output_folder>/wer.txt
save_folder: !ref <output_folder>/save
train_log: !ref <output_folder>/train_log.txt
# Language model (LM) pretraining
# NB: To avoid mismatch, the speech recognizer must be trained with the same
# tokenizer used for LM training. Here, we download everything from the
# speechbrain HuggingFace repository. However, a local path pointing to a
# directory containing the lm.ckpt and tokenizer.ckpt may also be specified
# instead. E.g if you want to use your own LM / tokenizer.
pretrained_path: ../language_model/results/RNNLM/save/CKPT+2021-05-12+15-27-08+00/
# Path where data manifest files will be stored. The data manifest files are created by the
# data preparation script
train_annotation: ../fluent_speech_commands_dataset/data/train_data.csv
valid_annotation: ../fluent_speech_commands_dataset/data/valid_data.csv
test_annotation: ../fluent_speech_commands_dataset/data/test_data.csv
# The train logger writes training statistics to a file, as well as stdout.
train_logger: !new:speechbrain.utils.train_logger.FileTrainLogger
save_file: !ref <train_log>
# Training parameters
number_of_epochs: 15
number_of_ctc_epochs: 5
batch_size: 8
lr: 1.0
ctc_weight: 0.5
sorting: random
ckpt_interval_minutes: 15 # save checkpoint every N min
label_smoothing: 0.1
# Dataloader options
train_dataloader_opts:
batch_size: !ref <batch_size>
valid_dataloader_opts:
batch_size: !ref <batch_size>
test_dataloader_opts:
batch_size: !ref <batch_size>
# Feature parameters
sample_rate: 16000
n_fft: 400
n_mels: 40
# Model parameters
activation: !name:torch.nn.LeakyReLU
dropout: 0.15
cnn_blocks: 2
cnn_channels: (128, 256)
inter_layer_pooling_size: (2, 2)
cnn_kernelsize: (3, 3)
time_pooling_size: 4
rnn_class: !name:speechbrain.nnet.RNN.LSTM
rnn_layers: 4
rnn_neurons: 1024
rnn_bidirectional: True
dnn_blocks: 2
dnn_neurons: 512
emb_size: 128
dec_neurons: 1024
output_neurons: 500 # Number of tokens (same as LM)
blank_index: 0
bos_index: 0
eos_index: 0
unk_index: 0
# Decoding parameters
min_decode_ratio: 0.0
max_decode_ratio: 1.0
valid_beam_size: 8
test_beam_size: 80
eos_threshold: 1.5
using_max_attn_shift: True
max_attn_shift: 240
lm_weight: 0.50
ctc_weight_decode: 0.0
coverage_penalty: 1.5
temperature: 1.25
temperature_lm: 1.25
# The first object passed to the Brain class is this "Epoch Counter"
# which is saved by the Checkpointer so that training can be resumed
# if it gets interrupted at any point.
epoch_counter: !new:speechbrain.utils.epoch_loop.EpochCounter
limit: !ref <number_of_epochs>
# Feature extraction
compute_features: !new:speechbrain.lobes.features.Fbank
sample_rate: !ref <sample_rate>
n_fft: !ref <n_fft>
n_mels: !ref <n_mels>
# Feature normalization (mean and std)
normalize: !new:speechbrain.processing.features.InputNormalization
norm_type: global
# Added noise and reverb come from OpenRIR dataset, automatically
# downloaded and prepared with this Environmental Corruption class.
env_corrupt: !new:speechbrain.lobes.augment.EnvCorrupt
openrir_folder: !ref <data_folder_rirs>
babble_prob: 0.0
reverb_prob: 0.0
noise_prob: 1.0
noise_snr_low: 0
noise_snr_high: 15
# Adds speech change + time and frequency dropouts (time-domain implementation).
augmentation: !new:speechbrain.lobes.augment.TimeDomainSpecAugment
sample_rate: !ref <sample_rate>
speeds: [95, 100, 105]
# The CRDNN model is an encoder that combines CNNs, RNNs, and DNNs.
encoder: !new:speechbrain.lobes.models.CRDNN.CRDNN
input_shape: [null, null, !ref <n_mels>]
activation: !ref <activation>
dropout: !ref <dropout>
cnn_blocks: !ref <cnn_blocks>
cnn_channels: !ref <cnn_channels>
cnn_kernelsize: !ref <cnn_kernelsize>
inter_layer_pooling_size: !ref <inter_layer_pooling_size>
time_pooling: True
using_2d_pooling: False
time_pooling_size: !ref <time_pooling_size>
rnn_class: !ref <rnn_class>
rnn_layers: !ref <rnn_layers>
rnn_neurons: !ref <rnn_neurons>
rnn_bidirectional: !ref <rnn_bidirectional>
rnn_re_init: True
dnn_blocks: !ref <dnn_blocks>
dnn_neurons: !ref <dnn_neurons>
use_rnnp: False
# Embedding (from indexes to an embedding space of dimension emb_size).
embedding: !new:speechbrain.nnet.embedding.Embedding
num_embeddings: !ref <output_neurons>
embedding_dim: !ref <emb_size>
# Attention-based RNN decoder.
decoder: !new:speechbrain.nnet.RNN.AttentionalRNNDecoder
enc_dim: !ref <dnn_neurons>
input_size: !ref <emb_size>
rnn_type: gru
attn_type: location
hidden_size: !ref <dec_neurons>
attn_dim: 1024
num_layers: 1
scaling: 1.0
channels: 10
kernel_size: 100
re_init: True
dropout: !ref <dropout>
# Linear transformation on the top of the encoder.
ctc_lin: !new:speechbrain.nnet.linear.Linear
input_size: !ref <dnn_neurons>
n_neurons: !ref <output_neurons>
# Linear transformation on the top of the decoder.
seq_lin: !new:speechbrain.nnet.linear.Linear
input_size: !ref <dec_neurons>
n_neurons: !ref <output_neurons>
# Final softmax (for log posteriors computation).
log_softmax: !new:speechbrain.nnet.activations.Softmax
apply_log: True
# Cost definition for the CTC part.
ctc_cost: !name:speechbrain.nnet.losses.ctc_loss
blank_index: !ref <blank_index>
# Tokenizer initialization
tokenizer: !new:sentencepiece.SentencePieceProcessor
# Objects in "modules" dict will have their parameters moved to the correct
# device, as well as having train()/eval() called on them by the Brain class
modules:
encoder: !ref <encoder>
embedding: !ref <embedding>
decoder: !ref <decoder>
ctc_lin: !ref <ctc_lin>
seq_lin: !ref <seq_lin>
normalize: !ref <normalize>
env_corrupt: !ref <env_corrupt>
lm_model: !ref <lm_model>
# Gathering all the submodels in a single model object.
model: !new:torch.nn.ModuleList
- - !ref <encoder>
- !ref <embedding>
- !ref <decoder>
- !ref <ctc_lin>
- !ref <seq_lin>
# This is the RNNLM that is used according to the Huggingface repository
# NB: It has to match the pre-trained RNNLM!!
lm_model: !new:speechbrain.lobes.models.RNNLM.RNNLM
output_neurons: !ref <output_neurons>
embedding_dim: !ref <emb_size>
activation: !name:torch.nn.LeakyReLU
dropout: 0.0
rnn_layers: 2
rnn_neurons: 2048
dnn_blocks: 1
dnn_neurons: 512
return_hidden: True # For inference
# Beamsearch is applied on the top of the decoder. If the language model is
# given, a language model is applied (with a weight specified in lm_weight).
# If ctc_weight is set, the decoder uses CTC + attention beamsearch. This
# improves the performance, but slows down decoding. For a description of
# the other parameters, please see the speechbrain.decoders.S2SRNNBeamSearchLM.
# It makes sense to have a lighter search during validation. In this case,
# we don't use the LM and CTC probabilities during decoding.
valid_search: !new:speechbrain.decoders.S2SRNNBeamSearcher
embedding: !ref <embedding>
decoder: !ref <decoder>
linear: !ref <seq_lin>
ctc_linear: !ref <ctc_lin>
bos_index: !ref <bos_index>
eos_index: !ref <eos_index>
blank_index: !ref <blank_index>
min_decode_ratio: !ref <min_decode_ratio>
max_decode_ratio: !ref <max_decode_ratio>
beam_size: !ref <valid_beam_size>
eos_threshold: !ref <eos_threshold>
using_max_attn_shift: !ref <using_max_attn_shift>
max_attn_shift: !ref <max_attn_shift>
coverage_penalty: !ref <coverage_penalty>
temperature: !ref <temperature>
# The final decoding on the test set can be more computationally demanding.
# In this case, we use the LM + CTC probabilities during decoding as well.
# Please, remove this part if you need a faster decoder.
test_search: !new:speechbrain.decoders.S2SRNNBeamSearchLM
embedding: !ref <embedding>
decoder: !ref <decoder>
linear: !ref <seq_lin>
ctc_linear: !ref <ctc_lin>
language_model: !ref <lm_model>
bos_index: !ref <bos_index>
eos_index: !ref <eos_index>
blank_index: !ref <blank_index>
min_decode_ratio: !ref <min_decode_ratio>
max_decode_ratio: !ref <max_decode_ratio>
beam_size: !ref <test_beam_size>
eos_threshold: !ref <eos_threshold>
using_max_attn_shift: !ref <using_max_attn_shift>
max_attn_shift: !ref <max_attn_shift>
coverage_penalty: !ref <coverage_penalty>
lm_weight: !ref <lm_weight>
ctc_weight: !ref <ctc_weight_decode>
temperature: !ref <temperature>
temperature_lm: !ref <temperature_lm>
# This function manages learning rate annealing over the epochs.
# We here use the NewBoB algorithm, that anneals the learning rate if
# the improvements over two consecutive epochs is less than the defined
# threshold.
lr_annealing: !new:speechbrain.nnet.schedulers.NewBobScheduler
initial_value: !ref <lr>
improvement_threshold: 0.0025
annealing_factor: 0.8
patient: 0
# This optimizer will be constructed by the Brain class after all parameters
# are moved to the correct device. Then it will be added to the checkpointer.
opt_class: !name:torch.optim.Adadelta
lr: !ref <lr>
rho: 0.95
eps: 1.e-8
# Functions that compute the statistics to track during the validation step.
error_rate_computer: !name:speechbrain.utils.metric_stats.ErrorRateStats
cer_computer: !name:speechbrain.utils.metric_stats.ErrorRateStats
split_tokens: True
# This object is used for saving the state of training both so that it
# can be resumed if it gets interrupted, and also so that the best checkpoint
# can be later loaded for evaluation or inference.
checkpointer: !new:speechbrain.utils.checkpoints.Checkpointer
checkpoints_dir: !ref <save_folder>
recoverables:
model: !ref <model>
scheduler: !ref <lr_annealing>
normalizer: !ref <normalize>
counter: !ref <epoch_counter>
# This object is used to pretrain the language model and the tokenizers
# (defined above). In this case, we also pretrain the ASR model (to make
# sure the model converges on a small amount of data)
#pretrainer: !new:speechbrain.utils.parameter_transfer.Pretrainer
# collect_in: !ref <save_folder>
# loadables:
# lm: !ref <lm_model>
# tokenizer: !ref <tokenizer>
# model: !ref <model>
# paths:
# lm: !ref <pretrained_path>/lm.ckpt
# tokenizer: !ref <pretrained_path>/tokenizer.ckpt
# model: !ref <pretrained_path>/asr.ckpt
此外,我删除了与prenaining阶段对应的行,因为我不知道如何使它们与我自己的数据集一起工作
run_on_main(hparams["pretrainer"].collect_files)
hparams["pretrainer"].load_collected(device=run_opts["device"])
我的问题是,fit model阶段在第一个要处理的数据时就被卡住了,我不知道如何修复它:
(Polette) aurelienmarchal@aurelienmarchal-X556UQ:~/Stage/Polette/speech_recognizer$ python3 train.py train.yaml --batch_size=2
../noise/rirs_noises.zip exists. Skipping download
speechbrain.core - Beginning experiment!
speechbrain.core - Experiment folder: results/CRDNN_BPE_960h_LM/42
speechbrain.core - Info: ckpt_interval_minutes arg from hparam file is used
speechbrain.core - 171.8M trainable parameters in ASR
speechbrain.utils.checkpoints - Would load a checkpoint here, but none found yet.
speechbrain.utils.epoch_loop - Going into epoch 1
0%| | 0/11566 [00:00<?, ?it/s]
speechbrain.core - Exception:
Traceback (most recent call last):
File "train.py", line 452, in <module>
asr_brain.fit(
File "/home/aurelienmarchal/.local/lib/python3.8/site-packages/speechbrain/core.py", line 1011, in fit
for batch in t:
File "/home/aurelienmarchal/.local/lib/python3.8/site-packages/tqdm/std.py", line 1133, in __iter__
for obj in iterable:
File "/home/aurelienmarchal/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 517, in __next__
data = self._next_data()
File "/home/aurelienmarchal/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 557, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "/home/aurelienmarchal/.local/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch
return self.collate_fn(data)
File "/home/aurelienmarchal/.local/lib/python3.8/site-packages/speechbrain/dataio/batch.py", line 125, in __init__
padded = PaddedData(*padding_func(values, **padding_kwargs))
File "/home/aurelienmarchal/.local/lib/python3.8/site-packages/speechbrain/utils/data_utils.py", line 415, in batch_pad_right
padded, valid_percent = pad_right_to(
File "/home/aurelienmarchal/.local/lib/python3.8/site-packages/speechbrain/utils/data_utils.py", line 353, in pad_right_to
valid_vals.append(tensor.shape[j] / target_shape[j])
ZeroDivisionError: division by zero
(Polette)aurelienmarchal@aurelienmarchal-X556UQ:~/Stage/Polette/speech_识别器$python3 train.py train.yaml--批处理大小=2
../noise/rirs\u noises.zip存在。跳过下载
speechbrain.core-开始实验!
speechbrain.core-实验文件夹:results/CRDNN_BPE_960h_LM/42
speechbrain.core-信息:使用hparam文件中的ckpt\u interval\u minutes arg
speechbrain.core-ASR中的171.8M可培训参数
speechbrain.utils.checkpoints-将在此处加载检查点,但尚未找到。
speechbrain.utils.epoch\u循环-进入epoch 1
0%| | 0/11566 [00:00