Python 基于给定字典标记连接字符
我想根据给定的字典对连接字符进行标记化,并给出和输出找到的标记化单词。例如,我有以下几点Python 基于给定字典标记连接字符,python,dictionary,Python,Dictionary,我想根据给定的字典对连接字符进行标记化,并给出和输出找到的标记化单词。例如,我有以下几点 dictionary = ['yak', 'kin', 'yakkin', 'khai', 'koo'] chars = 'yakkinpadthaikhaikoo' 输出应如下所示: [('yakkin', (0, 6), 6), ('padthai', (6, 13), 7), ('khai', (13, 17), 4), ('koo', (17, 20), 3)] 我想把元组列表作为输出。元组中的
dictionary = ['yak', 'kin', 'yakkin', 'khai', 'koo']
chars = 'yakkinpadthaikhaikoo'
输出应如下所示:
[('yakkin', (0, 6), 6), ('padthai', (6, 13), 7), ('khai', (13, 17), 4), ('koo', (17, 20), 3)]
我想把元组列表作为输出。元组中的第一个元素是在字典中找到的单词,第二个元素是字符偏移量,第三个元素是找到的单词的长度。如果找不到字符,我们将把它们拼成一个单词,如上面的Padtai。如果从字典中找到多个单词,我们将选择最长的一个,选择yakkin而不是yak和kin
下面是我当前的实现。它以索引0开始,然后在字符之间循环它还不起作用
import numpy as np
def tokenize(chars, dictionary):
n_chars = len(chars)
start = 0
char_found = []
words = []
for _ in range(int(n_chars/3)):
for r in range(1, n_chars + 1):
if chars[start:(start + r)] in dictionary:
char_found.append((chars[start:(start + r)], (start, start + r), len(chars[start:start+r])))
id_offset = np.argmax([t[1][1] for t in char_found])
start = char_found[id_offset][2]
if char_found[id_offset] not in words:
words.append(char_found[id_offset])
return words
tokenize(chars, dictionary) # give only [('yakkin', (0, 6), 6)]
我很难解决这个问题。请随意评论/建议 它看起来可能有点恶心,但它确实有效
def tokenize(string, dictionary):
# sorting dictionary words by length
# because we need to find longest word if its possible
# like "yakkin" instead of "yak"
sorted_dictionary = sorted(dictionary,
key=lambda word: len(word),
reverse=True)
start = 0
tokens = []
while start < len(string):
substring = string[start:]
try:
word = next(word
for word in sorted_dictionary
if substring.startswith(word))
offset = len(word)
except StopIteration:
# no words from dictionary were found
# at the beginning of substring,
# looking for next appearance of dictionary words
words_indexes = [substring.find(word)
for word in sorted_dictionary]
# if word is not found, "str.find" method returns -1
appeared_words_indexes = filter(lambda index: index > 0,
words_indexes)
try:
offset = min(appeared_words_indexes)
except ValueError:
# an empty sequence was passed to "min" function
# because there are no words from dictionary in substring
offset = len(substring)
word = substring[:offset]
token = word, (start, start + offset), offset
tokens.append(token)
start += offset
return tokens
您可以使用find来查找单词的起始索引,由于len,单词的长度是已知的。反复浏览字典中的每个单词,你的列表就完整了
def tokenize(chars, word_list):
tokens = []
for word in word_list:
word_len = len(word)
index = 0
# skips words that appear in longer words
skip = False
for other_word in word_list:
if word in other_word and len(other_word) > len(word):
print("skipped word:", word)
skip = True
if skip:
continue
while index < len(chars):
index = chars.find(word, index) # start search from index
if index == -1: # find() returns -1 if not found
break
# Append the tuple and continue the search at the end of the word
tokens.append((word, (index, word_len+index), word_len))
index += word_len
return tokens
你能详细说明一下你是如何得到“padthai”、6、13、7的吗?基本上,我想从第6个字符开始,直到字符结束。如果我找不到任何与我的字典相匹配的单词,我将转到第7个字符,依此类推。当我到达第13个字符时,我应该在搜索时找到新单词,直到结束。因此,我可以将字符6:13分块在一起。这几乎像我想要的格式,但在这种情况下,你也必须删除yak和kin,因为我们将选择yakkin。有办法吗?当然,看看我做的编辑。我刚刚添加了另一个迭代循环,用于检查是否在较长的单词中找到一个单词。只是提醒一下,这个方法是区分大小写的,尽管您可以使用lower或upper来忽略大小写。
def tokenize(chars, word_list):
tokens = []
for word in word_list:
word_len = len(word)
index = 0
# skips words that appear in longer words
skip = False
for other_word in word_list:
if word in other_word and len(other_word) > len(word):
print("skipped word:", word)
skip = True
if skip:
continue
while index < len(chars):
index = chars.find(word, index) # start search from index
if index == -1: # find() returns -1 if not found
break
# Append the tuple and continue the search at the end of the word
tokens.append((word, (index, word_len+index), word_len))
index += word_len
return tokens
>>>tokenize('yakkinpadthaikhaikoo', ['yak', 'kin', 'yakkin', 'khai', 'koo'])
skipped word: yak
skipped word: kin
[('yakkin', (0, 6), 6),
('khai', (13, 17), 4),
('koo', (17, 20), 3)]