Tensorflow词汇处理器_Tensorflow_Vocabulary

Tensorflow词汇处理器

tensorflow

Tensorflow词汇处理器,tensorflow,vocabulary,Tensorflow,Vocabulary,我正在关注关于使用tensorflow进行文本分类的wildml博客。我无法理解代码语句中最大文档长度的用途： vocab_processor = learn.preprocessing.VocabularyProcessor(max_document_length) 另外，我如何从vocab_处理器中提取词汇表我已经了解了如何从vocabularyprocessor对象中提取词汇表。这对我来说非常有效将numpy导入为np 从tensorflow.contrib导入学习 x_text=['

我正在关注关于使用tensorflow进行文本分类的wildml博客。我无法理解代码语句中最大文档长度的用途：

vocab_processor = learn.preprocessing.VocabularyProcessor(max_document_length)

另外，我如何从vocab_处理器中提取词汇表

我已经了解了如何从vocabularyprocessor对象中提取词汇表。这对我来说非常有效

将numpy导入为np
从tensorflow.contrib导入学习
x_text=['这是一只猫'，'这一定是男孩'，'这是一只狗']
max_document_length=max（[len（x.split（“”）表示x在x_文本中的x]）
##创建vocabularyprocessor对象，设置文档的最大长度。
vocab_processor=learn.preprocessing.VocabularyProcessor（最大文档长度）
##使用词汇表转换文档。
x=np.array（列表（vocab_处理器.fit_变换（x_文本）））
##从对象中提取word:id映射。
vocab_dict=vocab_处理器。词汇表映射
##根据值（id）对词汇词典进行排序。
##这两条语句执行相同的任务。
#sorted_vocab=sorted（vocab_dict.items（），key=operator.itemgetter（1））
sorted_vocab=sorted（vocab_dict.items（），key=lambda x:x[1]）
##将id视为列表的索引，并按id的升序创建单词列表
##id为i的单词位于列表的索引i处。
词汇表=列表（列表（zip（*排序的单词））[0]）
印刷（词汇）
打印（x）

无法理解最大文档长度的用途

VocabularyProcessor

将文本文档映射到向量中，您需要这些向量具有一致的长度

您的输入数据记录可能不会（或者可能不会）都是相同的长度。例如，如果你在使用情绪分析的句子，它们的长度会有所不同

将此参数提供给

词汇处理器

，以便它可以调整输出向量的长度。根据,

最大文档长度：文档的最大长度。如果文件更长的，他们将被修剪，如果较短的填充

看看这本书

注意行

word\u id=np.zero（self.max\u document\u length）

raw_documents

变量中的每一行都将映射到长度为

max_document_length

的向量

我正试图遵循相同的教程，但有一些事情我不明白。也许你能帮我一个忙？如果你看到单词，你可以看到“This”索引为1，“is”索引为2，依此类推。我想通过我自己的索引。例如，基于频率的。你知道怎么做吗？

  def transform(self, raw_documents):
    """Transform documents to word-id matrix.
    Convert words to ids with vocabulary fitted with fit or the one
    provided in the constructor.
    Args:
      raw_documents: An iterable which yield either str or unicode.
    Yields:
      x: iterable, [n_samples, max_document_length]. Word-id matrix.
    """
    for tokens in self._tokenizer(raw_documents):
      word_ids = np.zeros(self.max_document_length, np.int64)
      for idx, token in enumerate(tokens):
        if idx >= self.max_document_length:
          break
        word_ids[idx] = self.vocabulary_.get(token)
      yield word_ids