Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/295.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 在循环中运行keras标记器_Python_Tensorflow_Machine Learning_Keras - Fatal编程技术网

Python 在循环中运行keras标记器

Python 在循环中运行keras标记器,python,tensorflow,machine-learning,keras,Python,Tensorflow,Machine Learning,Keras,我有一个不同结构的多个文件,我想标记 例如,文件1: 事件名称、事件位置、事件描述、事件优先级 文件2: 事件名称、事件参与者、事件地点、事件描述、事件优先级 等等。我想用所有文件中的数据创建一个数组,然后标记它。不幸的是,当我运行tokenizer.fit_on_text()时,循环字典没有扩展,而是被覆盖。我必须在循环中使用标记器,因为我需要填充事件描述 我的代码: tokenizer = Tokenizer(num_words=50000, oov_token="<OOV&g

我有一个不同结构的多个文件,我想标记

例如,文件1:

事件名称、事件位置、事件描述、事件优先级

文件2:

事件名称、事件参与者、事件地点、事件描述、事件优先级

等等。我想用所有文件中的数据创建一个数组,然后标记它。不幸的是,当我运行
tokenizer.fit_on_text()
时,循环字典没有扩展,而是被覆盖。我必须在循环中使用标记器,因为我需要填充事件描述

我的代码:

    tokenizer = Tokenizer(num_words=50000, oov_token="<OOV>")
    for file in files:
        print("Loading : ", file)
        events= pd.read_csv(file)
        # prepare columns
        events['event_name'] = 'XXBOS XXEN ' + events['event_name'].astype(str)
        events['event_location'] = 'XXEL ' + events['event_location'].astype(str)
        events['event_description'] = 'XXED ' + events['event_description'].astype(str)
        events['event_priority'] = 'XXEP ' + events['event_priority'].astype(str) + ' XXEOS'
        # Tokenize concatenated columns into one
        tokenizer.fit_on_texts(np.concatenate((events['event_name'],events['event_location'], events['event_description'], events['event_priority']), axis=0))
        # Later I run texts_to_sequences on each column so later i am able to run pad_sequences on it and again I concatenate them
并添加另一个文本以适应:

text2="new sentence with unknown chars xxasqeew"
tokenizer.fit_on_texts(text2) 
tokenizer.word_index
{'<OOV>': 1, 'e': 2, 't': 3, 'n': 4, 's': 5, 'w': 6, 'o': 7, 'x': 8, 'r': 9, 'c': 10, 'h': 11, 'a': 12, 'm': 13, 'f': 14, 'i': 15, 'u': 16, 'k': 17, 'q': 18}
text2=“带有未知字符的新句子xxasqeew”
标记器.fit_on_文本(text2)
tokenizer.word\u索引
{':1,'e':2,'t':3,'n':4,'s':5,'w':6,'o':7,'x':8,'r':9,'c':10,'h':11,'a':12,'m':13,'f':14,'i':15,'u':16,'k':17,'q':18}

标记器中的索引已完全更改

只需存储事件,然后立即标记所有事件:

def create_tokenizer():
    return Tokenizer(num_words=50000, oov_token="<OOV>")

all_events = []
files_to_tokens_dict = {}
for file in files:
    print("Loading : ", file)
    events= pd.read_csv(file)
    # prepare columns
    events['event_name'] = 'XXBOS XXEN ' + events['event_name'].astype(str)
    events['event_location'] = 'XXEL ' + events['event_location'].astype(str)
    events['event_description'] = 'XXED ' + events['event_description'].astype(str)
    events['event_priority'] = 'XXEP ' + events['event_priority'].astype(str) + ' XXEOS'
    # Tokenize concatenated columns into one
    all_events.append(events['event_name'])
    all_events.append(events['event_location'])
    all_events.append(events['event_description'])
    all_events.append(events['event_priority'])
    tokenizer = create_tokenizer()
    tokenizer.fit_on_text(events['event_name'], events['event_location'], events['event_description'], events['event_priority'])
    tokens_in_current_file = tokenizer.word_index.keys()
    files_to_tokens_dict[file] = tokens_in_current_file

global_tokenizer = create_tokenizer()
global_tokenizer.fit_on_texts(all_events)
global_tokenizer.word_index # one word index with all tokens

dict不会被覆盖,而是会被更新。单词的顺序在每次迭代后都会发生变化,因为
fit\u on\u text
根据单词出现的次数对单词索引进行排序(例如,最常见的单词位于索引“1”,第二常见的单词位于索引“2”,等等(“保留0”索引))

例如:

from tensorflow.keras.preprocessing.text import Tokenizer


tokenizer = Tokenizer()

text1 = ["aaa bbb ccc"]
tokenizer.fit_on_texts(text1)
print("1. iteration", tokenizer.word_index)

text2 = ["bbb ccc ddd"]
tokenizer.fit_on_texts(text2)
print("2. iteration", tokenizer.word_index)

text3 = ["ccc ddd eee"]
tokenizer.fit_on_texts(text3)
print("3. iteration", tokenizer.word_index)

# "ccc" occurs three times    
# "bbb" occurs twice
# "ddd" occurs twice
# "aaa" occurs once
# "eee" occurs once

# The actual output:
# 1. iteration {'aaa': 1, 'bbb': 2, 'ccc': 3}
# 2. iteration {'bbb': 1, 'ccc': 2, 'aaa': 3, 'ddd': 4}
# 3. iteration {'ccc': 1, 'bbb': 2, 'ddd': 3, 'aaa': 4, 'eee': 5}

在所有文件中循环一次,以获取所有唯一值,然后将所有这些唯一值放入标记器。是的,它确实存在,但问题被编辑了
def get_token_indices(file):
    tokens_in_file = files_to_tokens_dict[file]
    result = []
    for token in tokens_in_file:
        global_token_index = global_tokenizer.word_index[token]
        result.append(global_token_index)
    return result
from tensorflow.keras.preprocessing.text import Tokenizer


tokenizer = Tokenizer()

text1 = ["aaa bbb ccc"]
tokenizer.fit_on_texts(text1)
print("1. iteration", tokenizer.word_index)

text2 = ["bbb ccc ddd"]
tokenizer.fit_on_texts(text2)
print("2. iteration", tokenizer.word_index)

text3 = ["ccc ddd eee"]
tokenizer.fit_on_texts(text3)
print("3. iteration", tokenizer.word_index)

# "ccc" occurs three times    
# "bbb" occurs twice
# "ddd" occurs twice
# "aaa" occurs once
# "eee" occurs once

# The actual output:
# 1. iteration {'aaa': 1, 'bbb': 2, 'ccc': 3}
# 2. iteration {'bbb': 1, 'ccc': 2, 'aaa': 3, 'ddd': 4}
# 3. iteration {'ccc': 1, 'bbb': 2, 'ddd': 3, 'aaa': 4, 'eee': 5}