Python 如何使我的文本解析功能更加高效/友好？_Python_Deep Learning_Lstm_Corpus

Python 如何使我的文本解析功能更加高效/友好？

python deep-learning

Python 如何使我的文本解析功能更加高效/友好？,python,deep-learning,lstm,corpus,Python,Deep Learning,Lstm,Corpus,我试图对一个大型语料库（大约2MB）进行预处理，使文本中的每个单词都按照后面的两个单词进行分组（即以3个单词为一组）。因此，对于以下输入： “人吃了苹果”，我会得到（人，人，吃了），（人，吃了，苹果）。然后我想对每个单词进行矢量化，创建一个数据集（其中前两个单词用作输入，第三个单词用作输出），并将其输入到LSTM中当在Google Compute Engine中的实例上运行以下代码时，当我增加（Keras）标记器接受的最大字数时，进程总是被终止。关于如何使我的代码更高效，有什么想法吗 size

我试图对一个大型语料库（大约2MB）进行预处理，使文本中的每个单词都按照后面的两个单词进行分组（即以3个单词为一组）。因此，对于以下输入：

“人吃了苹果”

，我会得到

（人，人，吃了），（人，吃了，苹果）

。然后我想对每个单词进行矢量化，创建一个数据集（其中前两个单词用作输入，第三个单词用作输出），并将其输入到LSTM中

当在Google Compute Engine中的实例上运行以下代码时，当我增加（Keras）标记器接受的最大字数时，进程总是被终止。关于如何使我的代码更高效，有什么想法吗

size_of_vocabulary = 1000

def preprocess_corpus():

    text = load_corpus(filename)
    print("Preprocessing...")

    tokenizer = Tokenizer(num_words=size_of_vocabulary)
    tokenizer.fit_on_texts([text])

    word_index = tokenizer.word_index
    reverse_word_index = dict(zip(word_index.values(), word_index.keys()))  

    return text, word_index, reverse_word_index

def trie_data():

    def clean_text(text):
        filters = '!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n'
        translate_map = str.maketrans(filters, " " * len(filters))
        return text.translate(translate_map)

    def vectorize_word(word):
        word_vector = np.zeros(size_of_vocabulary-1).astype('float32')
        word_vector[word_index[word]] = 1.0
        return word_vector

    text, word_index, reverse_word_index = preprocess_corpus()
    clean_text = clean_text(text).split()

    X_data = list()
    Y_data = list()


    # Use generator (useful for large texts)
    def enumerate_data():
        for index, word in enumerate(clean_text):
            if index+2 < len(clean_text):
                if word_index[clean_text[index+2]] < size_of_vocabulary -1:
                    yield np.asarray([word_index[clean_text[index]], word_index[clean_text[index+1]]]), vectorize_word(clean_text[index+2])

    data = enumerate_data()
    for i in data:
        X_data.append(i[0])
        Y_data.append(i[1])

    return np.asarray(X_data), np.asarray(Y_data), word_index

词汇表的大小=1000
def preprocess_corpus（）：
text=加载语料库（文件名）
打印（“预处理…”）
标记器=标记器（单词数=词汇表大小）
标记器。在文本（[文本]）上匹配
word\u index=tokenizer.word\u index
reverse\u word\u index=dict（zip（word\u index.values（），word\u index.keys（））
返回文本、单词索引、反向单词索引
def trie_data（）：
def清洁_文本（文本）：
过滤器='！“#$%&（）*+，-./：；？@[\\]^{{124;}~\ t\n”
translate_map=str.maketrans（过滤器，“*len（过滤器））
返回text.translate（translate\u映射）
def矢量化单词（word）：
word_vector=np.zero（单词表-1的大小）。astype（'float32'））
单词向量[单词索引[单词]]=1.0
返回字向量
文本，单词索引，反向单词索引=预处理语料库（）
clean_text=clean_text（text）.split（）
X_data=list（）
Y_数据=列表（）
#使用生成器（适用于大文本）
def枚举_数据（）：
对于索引，枚举中的单词（干净的文本）：
如果索引+2

它为什么会被杀死？适用于此。在您发布MCVE代码并准确描述问题之前，我们无法有效帮助您。我们应该能够将您发布的代码粘贴到文本文件中，并重现您描述的问题。一种可能的方法是将单词转换为整数（词典索引）马上。这会给你三倍（1，2，3），（2，3，1），（3，1，4），等等。这将用固定长度的整数替换字符串。为什么会被终止？适用于此处。在您发布MCVE代码并准确描述问题之前，我们无法有效帮助您。我们应该能够将您发布的代码粘贴到文本文件中，并重现您描述的问题。一种可能的方法是将单词转换为整数（词典索引）。这将给你三元组（1，2，3），（2，3，1），（3，1，4），等等。这将用固定长度的整数替换字符串。