Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/346.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 使用tf.contrib.learn.preprocessing.VocabularyProcessor后,词汇量变小了_Python_Tensorflow_Nlp_Word2vec_Word Embedding - Fatal编程技术网

Python 使用tf.contrib.learn.preprocessing.VocabularyProcessor后,词汇量变小了

Python 使用tf.contrib.learn.preprocessing.VocabularyProcessor后,词汇量变小了,python,tensorflow,nlp,word2vec,word-embedding,Python,Tensorflow,Nlp,Word2vec,Word Embedding,首先,我用上面的代码训练单词嵌入,我认为它没有任何问题。我创建了一个列表vocab,将单词存储在矢量文件中。然后 #!/usr/bin/env python # -*- coding: utf-8 -*- import warnings warnings.filterwarnings(action='ignore', category=UserWarning, module='gensim') import logging import os.path import sys i

首先,我用上面的代码训练单词嵌入,我认为它没有任何问题。我创建了一个列表
vocab
,将单词存储在矢量文件中。然后

#!/usr/bin/env python
# -*- coding: utf-8  -*-    
import warnings

warnings.filterwarnings(action='ignore', category=UserWarning, module='gensim')  

import logging
import os.path
import sys
import multiprocessing

# from gensim.corpora import WikiCorpus
from gensim.models import Word2Vec
from gensim.models.word2vec import LineSentence

if __name__ == '__main__':    
    program = os.path.basename(sys.argv[0])
    logger = logging.getLogger(program)

    logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s', level=logging.INFO)
    logger.info("running %s" % ' '.join(sys.argv))

    min_count=100
    data_dir='/opt/mengyuguang/word2vec/'
    inp = data_dir + 'wiki.zh.simp.seg.txt'
    outp1 = data_dir + 'wiki.zh.min_count{}.model'.format(str(min_count))
    outp2 = data_dir + 'wiki.zh.min_count{}.vector'.format(str(min_count))

    # train cbow
    model = Word2Vec(LineSentence(inp), size=300,
                     workers=multiprocessing.cpu_count(),min_count=min_count)

    # save
    model.save(outp1)
    model.wv.save_word2vec_format(outp2, binary=False)
Vocab是一个包含415657个单词的列表。我的词汇量是412722。我知道
vocab_处理器.fit
不会将大小写作为两个单词。这真奇怪。这是怎么发生的?
我再次检查了矢量文件。根本没有重叠的单词

vocab_processor = tf.contrib.learn.preprocessing.VocabularyProcessor(max_document_length) 
pretrain = vocab_processor.fit(vocab)
我用上面的代码检查了词汇表。 这就是我得到的:

import codecs
import numpy as np
import tensorflow as tf
from tensorflow.contrib import learn
f=codecs.open('wiki.min_count10_char.vector','r','utf-8')
list1=[]
for i in f:
    list1.append(i.strip().split()[0])
vocab_processor = learn.preprocessing.VocabularyProcessor(5)  
x = np.array(list(vocab_processor.fit_transform(list1)))
vocab_size=len(vocab_processor.vocabulary_)
print(len(list1),vocab_size,vocab_processor.vocabulary_)
vocab_dict = vocab_processor.vocabulary_._mapping
sorted_vocab = sorted(vocab_dict.items(), key = lambda x : x[1])
vocabulary = list(list(zip(*sorted_vocab))[0])
for i in list1:
    if i not in vocabulary:
        print(i,end='')

,=.*、():《圣经》:《圣经》:《圣经》:《圣经》/《圣经》-《圣经》;"}{・%–/『』~!+〈〉…?→!─&〜~‧×※.°$’%>‘■☆?℃•〔〕=−@[]―︰́○​←★′□†་|﹕█&+﹑>↔●﹐\◇「<♪」"・*»्ा^±«ั♥∞‰‎်­ิ⇔≤£﹞﹝#‐″®━∈ี`│◎่ོ⇒你应该能够制作一个小小的语料库,如代码中所示,来展示现存词汇的差异。通过精确地查看哪些单词在一条路径上,与另一条路径上相比,应该可以清楚地看出
VocabularyProcessor
在标记化方面与
gensim
LineSequence
有什么不同。