C++ CNTK中大文件的CTF读取器引发错误_C++_Nlp_Deep Learning_Cntk

C++ CNTK中大文件的CTF读取器引发错误

c++ nlp deep-learning

C++ CNTK中大文件的CTF读取器引发错误,c++,nlp,deep-learning,cntk,C++,Nlp,Deep Learning,Cntk,在Github上的CNTK教程之后，我使用了一个CTF阅读器函数 def create_reader(path, is_training, input_dim, label_dim): return MinibatchSource(CTFDeserializer(path, StreamDefs( features = StreamDef(field='x', shape=input_dim, is_sparse=True), labels = Strea

在Github上的CNTK教程之后，我使用了一个CTF阅读器函数

def create_reader(path, is_training, input_dim, label_dim):
    return MinibatchSource(CTFDeserializer(path, StreamDefs(
        features = StreamDef(field='x', shape=input_dim, is_sparse=True),
        labels = StreamDef(field='y', shape=label_dim, is_sparse=False)
    )), randomize=is_training, epoch_size= INFINITELY_REPEAT if is_training else FULL_DATA_SWEEP)

除非输入文件的大小大于某个特定的大小（未知），否则此操作完全正常。然后它抛出如下错误：

WARNING: Sparse index value (269) at offset 8923303 in the input file (C:\local\CNTK-2-0-beta6-0-Windows-64bit-CPU-Only\cntk\Examples\common\data_pos_train_balanced_ctf.txt) exceeds the maximum expected value (268).
attempt: Reached the maximum number of allowed errors while reading the input file (C:\local\CNTK-2-0-beta6-0-Windows-64bit-CPU-Only\cntk\Examples\common\data_pos_train_balanced_ctf.txt)., retrying 2-th time out of 5...
.
.
.

RuntimeError: Reached the maximum number of allowed errors while reading the input file (C:\local\CNTK-2-0-beta6-0-Windows-64bit-CPU-Only\cntk\Examples\common\data_pos_train_balanced_ctf.txt).

我发现这种错误正在TextParser.cpp文件中抛出

解决这个问题的方法是什么？

您需要知道输入的维度，还需要知道索引从0开始。因此，如果您创建了一个将词汇表映射到1到20000范围的输入文件，则维度为20001。

听起来很像您的输入文件格式不正确。查看错误消息包含的偏移量周围的输入文件：8923303。例如，您可以通过“tail-c+8923000 file.tsv”跳过文件的前8923000字节。我正在按照相同的逻辑创建输入文件。现在，只有当输入变大时，才会失败。我不明白，如果我的逻辑是错误的，那么对于较小的文件，这将如何成功运行？例如，这适用于所有输入大小，例如25K左右的序列。从那以后它就开始失败了。如果是一个一个的错误，一个罕见的词负责，这是可能的。如果将输入维度定义为大于其当前值，是否仍会出现相同的错误？事实确实如此。输入维度未正确更新。谢谢