Python 在循环中运行keras标记器
我有一个不同结构的多个文件,我想标记 例如,文件1:Python 在循环中运行keras标记器,python,tensorflow,machine-learning,keras,Python,Tensorflow,Machine Learning,Keras,我有一个不同结构的多个文件,我想标记 例如,文件1: 事件名称、事件位置、事件描述、事件优先级 文件2: 事件名称、事件参与者、事件地点、事件描述、事件优先级 等等。我想用所有文件中的数据创建一个数组,然后标记它。不幸的是,当我运行tokenizer.fit_on_text()时,循环字典没有扩展,而是被覆盖。我必须在循环中使用标记器,因为我需要填充事件描述 我的代码: tokenizer = Tokenizer(num_words=50000, oov_token="<OOV&g
事件名称、事件位置、事件描述、事件优先级
文件2:
事件名称、事件参与者、事件地点、事件描述、事件优先级
等等。我想用所有文件中的数据创建一个数组,然后标记它。不幸的是,当我运行tokenizer.fit_on_text()
时,循环字典没有扩展,而是被覆盖。我必须在循环中使用标记器,因为我需要填充事件描述
我的代码:
tokenizer = Tokenizer(num_words=50000, oov_token="<OOV>")
for file in files:
print("Loading : ", file)
events= pd.read_csv(file)
# prepare columns
events['event_name'] = 'XXBOS XXEN ' + events['event_name'].astype(str)
events['event_location'] = 'XXEL ' + events['event_location'].astype(str)
events['event_description'] = 'XXED ' + events['event_description'].astype(str)
events['event_priority'] = 'XXEP ' + events['event_priority'].astype(str) + ' XXEOS'
# Tokenize concatenated columns into one
tokenizer.fit_on_texts(np.concatenate((events['event_name'],events['event_location'], events['event_description'], events['event_priority']), axis=0))
# Later I run texts_to_sequences on each column so later i am able to run pad_sequences on it and again I concatenate them
并添加另一个文本以适应:
text2="new sentence with unknown chars xxasqeew"
tokenizer.fit_on_texts(text2)
tokenizer.word_index
{'<OOV>': 1, 'e': 2, 't': 3, 'n': 4, 's': 5, 'w': 6, 'o': 7, 'x': 8, 'r': 9, 'c': 10, 'h': 11, 'a': 12, 'm': 13, 'f': 14, 'i': 15, 'u': 16, 'k': 17, 'q': 18}
text2=“带有未知字符的新句子xxasqeew”
标记器.fit_on_文本(text2)
tokenizer.word\u索引
{':1,'e':2,'t':3,'n':4,'s':5,'w':6,'o':7,'x':8,'r':9,'c':10,'h':11,'a':12,'m':13,'f':14,'i':15,'u':16,'k':17,'q':18}
标记器中的索引已完全更改只需存储事件,然后立即标记所有事件:
def create_tokenizer():
return Tokenizer(num_words=50000, oov_token="<OOV>")
all_events = []
files_to_tokens_dict = {}
for file in files:
print("Loading : ", file)
events= pd.read_csv(file)
# prepare columns
events['event_name'] = 'XXBOS XXEN ' + events['event_name'].astype(str)
events['event_location'] = 'XXEL ' + events['event_location'].astype(str)
events['event_description'] = 'XXED ' + events['event_description'].astype(str)
events['event_priority'] = 'XXEP ' + events['event_priority'].astype(str) + ' XXEOS'
# Tokenize concatenated columns into one
all_events.append(events['event_name'])
all_events.append(events['event_location'])
all_events.append(events['event_description'])
all_events.append(events['event_priority'])
tokenizer = create_tokenizer()
tokenizer.fit_on_text(events['event_name'], events['event_location'], events['event_description'], events['event_priority'])
tokens_in_current_file = tokenizer.word_index.keys()
files_to_tokens_dict[file] = tokens_in_current_file
global_tokenizer = create_tokenizer()
global_tokenizer.fit_on_texts(all_events)
global_tokenizer.word_index # one word index with all tokens
dict不会被覆盖,而是会被更新。单词的顺序在每次迭代后都会发生变化,因为
fit\u on\u text
根据单词出现的次数对单词索引进行排序(例如,最常见的单词位于索引“1”,第二常见的单词位于索引“2”,等等(“保留0”索引))
例如:
from tensorflow.keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer()
text1 = ["aaa bbb ccc"]
tokenizer.fit_on_texts(text1)
print("1. iteration", tokenizer.word_index)
text2 = ["bbb ccc ddd"]
tokenizer.fit_on_texts(text2)
print("2. iteration", tokenizer.word_index)
text3 = ["ccc ddd eee"]
tokenizer.fit_on_texts(text3)
print("3. iteration", tokenizer.word_index)
# "ccc" occurs three times
# "bbb" occurs twice
# "ddd" occurs twice
# "aaa" occurs once
# "eee" occurs once
# The actual output:
# 1. iteration {'aaa': 1, 'bbb': 2, 'ccc': 3}
# 2. iteration {'bbb': 1, 'ccc': 2, 'aaa': 3, 'ddd': 4}
# 3. iteration {'ccc': 1, 'bbb': 2, 'ddd': 3, 'aaa': 4, 'eee': 5}
在所有文件中循环一次,以获取所有唯一值,然后将所有这些唯一值放入标记器。是的,它确实存在,但问题被编辑了
def get_token_indices(file):
tokens_in_file = files_to_tokens_dict[file]
result = []
for token in tokens_in_file:
global_token_index = global_tokenizer.word_index[token]
result.append(global_token_index)
return result
from tensorflow.keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer()
text1 = ["aaa bbb ccc"]
tokenizer.fit_on_texts(text1)
print("1. iteration", tokenizer.word_index)
text2 = ["bbb ccc ddd"]
tokenizer.fit_on_texts(text2)
print("2. iteration", tokenizer.word_index)
text3 = ["ccc ddd eee"]
tokenizer.fit_on_texts(text3)
print("3. iteration", tokenizer.word_index)
# "ccc" occurs three times
# "bbb" occurs twice
# "ddd" occurs twice
# "aaa" occurs once
# "eee" occurs once
# The actual output:
# 1. iteration {'aaa': 1, 'bbb': 2, 'ccc': 3}
# 2. iteration {'bbb': 1, 'ccc': 2, 'aaa': 3, 'ddd': 4}
# 3. iteration {'ccc': 1, 'bbb': 2, 'ddd': 3, 'aaa': 4, 'eee': 5}