Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/285.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/ionic-framework/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
在Python中从文本创建序列向量_Python_Word2vec_Lstm - Fatal编程技术网

在Python中从文本创建序列向量

在Python中从文本创建序列向量,python,word2vec,lstm,Python,Word2vec,Lstm,我现在正试图为基于LSTM的NN准备输入数据。 我有一些大量的文本文档,我想要的是为每个文档生成序列向量,这样我就能够将它们作为训练数据提供给LSTM RNN 我拙劣的方法: import re import numpy as np #raw data train_docs = ['this is text number one', 'another text that i have'] #put all docs together train_data = '' for val in tra

我现在正试图为基于LSTM的NN准备输入数据。 我有一些大量的文本文档,我想要的是为每个文档生成序列向量,这样我就能够将它们作为训练数据提供给LSTM RNN

我拙劣的方法:

import re
import numpy as np
#raw data
train_docs = ['this is text number one', 'another text that i have']

#put all docs together
train_data = ''
for val in train_docs:
    train_data += ' ' + val

tokens = np.unique(re.findall('[a-zа-я0-9]+', train_data.lower()))
voc = {v: k for k, v in dict(enumerate(tokens)).items()}
然后蛮力用voc dict替换每个文档


是否有任何LIB可以帮助完成此任务?

您可以使用NLTK标记培训文档。NLTK提供了一个标准的word tokeniser,或允许您定义自己的tokeniser,例如RegexpTokenizer。查看有关可用的不同tokeniser函数的更多详细信息

也可能有助于对文本进行预处理

下面是使用NLTK预先训练好的word tokeniser的快速演示:

from nltk import word_tokenize

train_docs = ['this is text number one', 'another text that i have']
train_docs = ' '.join(map(str, train_docs))

tokens = word_tokenize(train_docs)
voc = {v: k for k, v in dict(enumerate(tokens)).items()}

您可以使用NLTK标记培训文档。NLTK提供了一个标准的word tokeniser,或允许您定义自己的tokeniser,例如RegexpTokenizer。查看有关可用的不同tokeniser函数的更多详细信息

也可能有助于对文本进行预处理

下面是使用NLTK预先训练好的word tokeniser的快速演示:

from nltk import word_tokenize

train_docs = ['this is text number one', 'another text that i have']
train_docs = ' '.join(map(str, train_docs))

tokens = word_tokenize(train_docs)
voc = {v: k for k, v in dict(enumerate(tokens)).items()}

使用Keras文本预处理类解决:

这样做:

from keras.preprocessing.text import Tokenizer, text_to_word_sequence

train_docs = ['this is text number one', 'another text that i have']
tknzr = Tokenizer(lower=True, split=" ")
tknzr.fit_on_texts(train_docs)
#vocabulary:
print(tknzr.word_index)

Out[1]:
{'this': 2, 'is': 3, 'one': 4, 'another': 9, 'i': 5, 'that': 6, 'text': 1, 'number': 8, 'have': 7}

#making sequences:
X_train = tknzr.texts_to_sequences(train_docs)
print(X_train)

Out[2]:
[[2, 3, 1, 8, 4], [9, 1, 6, 5, 7]]

使用Keras文本预处理类解决:

这样做:

from keras.preprocessing.text import Tokenizer, text_to_word_sequence

train_docs = ['this is text number one', 'another text that i have']
tknzr = Tokenizer(lower=True, split=" ")
tknzr.fit_on_texts(train_docs)
#vocabulary:
print(tknzr.word_index)

Out[1]:
{'this': 2, 'is': 3, 'one': 4, 'another': 9, 'i': 5, 'that': 6, 'text': 1, 'number': 8, 'have': 7}

#making sequences:
X_train = tknzr.texts_to_sequences(train_docs)
print(X_train)

Out[2]:
[[2, 3, 1, 8, 4], [9, 1, 6, 5, 7]]
见:见: