Python 单词嵌入：单词id列表的维度_Python_Tensorflow_Nlp

Python 单词嵌入：单词id列表的维度

python tensorflow nlp

Python 单词嵌入：单词id列表的维度,python,tensorflow,nlp,Python,Tensorflow,Nlp,我想用我下载的预训练手套嵌入来执行单词嵌入我使用一个6字的句子作为测试，并将文档的最大长度设置为30。我正在使用learn.preprocessing.VocabularyProcessor（）对象来学习令牌id字典。我正在使用此对象的transform（）方法将输入句子转换为单词ID列表，以便在嵌入矩阵中查找它们为什么VocabularyProcessor.transform（）方法返回一个6 x 30数组？我希望它只返回测试句子中每个单词的ID列表 #show vocab and e

我想用我下载的预训练手套嵌入来执行单词嵌入

我使用一个6字的句子作为测试，并将文档的最大长度设置为30。我正在使用learn.preprocessing.VocabularyProcessor（）对象来学习令牌id字典。我正在使用此对象的transform（）方法将输入句子转换为单词ID列表，以便在嵌入矩阵中查找它们

为什么VocabularyProcessor.transform（）方法返回一个6 x 30数组？我希望它只返回测试句子中每个单词的ID列表

#show vocab and embedding print('vocab size:%d\n' % vocab_size) print('embedding dim:%d\n' %embedding_dim) #test input test_input_sentence="the cat sat on the mat" test_words_list=test_input_sentence.split() print (test_words_list) #create embedding matrix W, and define a placeholder to be fed W = tf.Variable(tf.constant(0.0, shape=[vocab_size, embedding_dim]), trainable=False, name="W") embedding_placeholder = tf.placeholder(tf.float32, [vocab_size, embedding_dim]) embedding_init = W.assign(embedding_placeholder) print('initalised embedding') print(embedding_init.get_shape()) with tf.Session() as sess: sess.run(embedding_init, feed_dict={embedding_placeholder: embedding}) #init a vocab processor object vocab_processor = learn.preprocessing.VocabularyProcessor(max_document_length) #fit = Learn a vocabulary dictionary of all tokens in the raw documents. pretrain = vocab_processor.fit(vocab) print('vocab preprocessor done') #transform input to word-id matrix. x = np.array(list(vocab_processor.transform(test_words_list))) print('word id list shape:') print (x.shape) print('embedding tensor shape:') print(W.get_shape()) vec=tf.nn.embedding_lookup(W,x) print ('vectors shape:') print (vec.get_shape()) print ('embeddings:') print (sess.run(vec))

根据
transform（）
函数中代码中的注释：

"""Transforms input documents into sequence of ids. Args: X: iterator or list of input documents. Documents can be bytes or unicode strings, which will be encoded as utf-8 to map to bytes. Note, in Python2 str and bytes is the same type. Returns: iterator of byte ids. """
由于您要传递一个令牌列表，而函数需要一个文档列表，因此列表中的每个单词都被视为一个文档，因此其形状为6x30