Stanford nlp 培训时文件溢出_xxxx.bin是什么意思_Stanford Nlp

Stanford nlp 培训时文件溢出_xxxx.bin是什么意思

stanford-nlp

Stanford nlp 培训时文件溢出_xxxx.bin是什么意思,stanford-nlp,Stanford Nlp,我正在训练一个基于手套法的单词嵌入模型。而算法显示的记录器类似于： $ build/cooccur -memory 4.0 -vocab-file vocab.txt -verbose 2 -window-size 8 < /home/ignacio/data/GUsDany/corpus/GUs_regulon_pubMed.txt > cooccurrence.bin COUNTING COOCCURRENCES window size: 8 context: symmetric

我正在训练一个基于手套法的单词嵌入模型。而算法显示的记录器类似于：

$ build/cooccur -memory 4.0 -vocab-file vocab.txt -verbose 2 -window-size 8 < /home/ignacio/data/GUsDany/corpus/GUs_regulon_pubMed.txt > cooccurrence.bin
COUNTING COOCCURRENCES
window size: 8
context: symmetric
max product: 13752509
overflow length: 38028356
Reading vocab from file "vocab.txt"...loaded 145223095 words.
Building lookup table...table contains 228170143 elements.
Processing token: 5478600000

$build/cooccur-memory 4.0-vocab文件vocab.txt-verbose 2-window size 8coccurrence.bin
计数共现
窗口大小：8
上下文：对称
最高产品：13752509
溢流长度：38028356
从文件“vocab.txt”读取vocab…加载了145223095个单词。
构建查找表…表包含228170143个元素。
处理令牌：5478600000

手套的主目录中充满了文件caled

overflow\u 0534.bin

。有人能告诉我一切是否顺利吗

谢谢，一切都好

您可以在查看Glove cooccur程序的源代码

在文件的第57行：

long long overflow_length; // Number of cooccurrence records whose product exceeds max_product to store in memory before writing to disk

如果您的语料库中有太多的共现记录，那么将有一些数据写入一些temp-bin磁盘文件中

while (1) {
    if (ind >= overflow_length - window_size) { // If overflow buffer is (almost) full, sort it and write it to temporary file
        qsort(cr, ind, sizeof(CREC), compare_crec);
        write_chunk(cr,ind,foverflow);
        fclose(foverflow);
        fidcounter++;
        sprintf(filename,"%s_%04d.bin",file_head,fidcounter);
        foverflow = fopen(filename,"w");
        ind = 0;
    }

变量

overflow\u length

取决于内存设置

第463行：

if ((i = find_arg((char *)"-memory", argc, argv)) > 0) memory_limit = atof(argv[i + 1]);

第467行：

rlimit = 0.85 * (real)memory_limit * 1073741824/(sizeof(CREC));

第470行：

overflow_length = (long long) rlimit/6; // 0.85 + 1/6 ~= 1

一切都好

您可以在查看Glove cooccur程序的源代码

在文件的第57行：

long long overflow_length; // Number of cooccurrence records whose product exceeds max_product to store in memory before writing to disk

如果您的语料库中有太多的共现记录，那么将有一些数据写入一些temp-bin磁盘文件中

while (1) {
    if (ind >= overflow_length - window_size) { // If overflow buffer is (almost) full, sort it and write it to temporary file
        qsort(cr, ind, sizeof(CREC), compare_crec);
        write_chunk(cr,ind,foverflow);
        fclose(foverflow);
        fidcounter++;
        sprintf(filename,"%s_%04d.bin",file_head,fidcounter);
        foverflow = fopen(filename,"w");
        ind = 0;
    }

变量

overflow\u length

取决于内存设置

第463行：

if ((i = find_arg((char *)"-memory", argc, argv)) > 0) memory_limit = atof(argv[i + 1]);

第467行：

rlimit = 0.85 * (real)memory_limit * 1073741824/(sizeof(CREC));

第470行：

overflow_length = (long long) rlimit/6; // 0.85 + 1/6 ~= 1

谢谢你的回复。那么，如何避免大量文件阻止我训练>=300维的模型？@Nacho

溢出\u xxx.bin

文件是缓存文件，所以在生成

coccurrence.bin

时，您可以删除它们。如果您想避免这些文件，可能需要更多的ram。谢谢您的回复。那么，如何避免大量文件阻止我训练>=300维的模型？@Nacho

溢出\u xxx.bin

文件是缓存文件，所以在生成

coccurrence.bin

时，您可以删除它们。如果要避免这些文件，可能需要更多的ram。