Python 基于给定字典标记连接字符

Python 基于给定字典标记连接字符,python,dictionary,Python,Dictionary,我想根据给定的字典对连接字符进行标记化,并给出和输出找到的标记化单词。例如,我有以下几点 dictionary = ['yak', 'kin', 'yakkin', 'khai', 'koo'] chars = 'yakkinpadthaikhaikoo' 输出应如下所示: [('yakkin', (0, 6), 6), ('padthai', (6, 13), 7), ('khai', (13, 17), 4), ('koo', (17, 20), 3)] 我想把元组列表作为输出。元组中的

我想根据给定的字典对连接字符进行标记化,并给出和输出找到的标记化单词。例如,我有以下几点

dictionary = ['yak', 'kin', 'yakkin', 'khai', 'koo']
chars = 'yakkinpadthaikhaikoo'
输出应如下所示:

[('yakkin', (0, 6), 6), ('padthai', (6, 13), 7), ('khai', (13, 17), 4), ('koo', (17, 20), 3)]
我想把元组列表作为输出。元组中的第一个元素是在字典中找到的单词,第二个元素是字符偏移量,第三个元素是找到的单词的长度。如果找不到字符,我们将把它们拼成一个单词,如上面的Padtai。如果从字典中找到多个单词,我们将选择最长的一个,选择yakkin而不是yak和kin

下面是我当前的实现。它以索引0开始,然后在字符之间循环它还不起作用

import numpy as np

def tokenize(chars, dictionary):
    n_chars = len(chars)
    start = 0
    char_found = []
    words = []
    for _ in range(int(n_chars/3)):
        for r in range(1, n_chars + 1):
            if chars[start:(start + r)] in dictionary:
                char_found.append((chars[start:(start + r)], (start, start + r), len(chars[start:start+r])))
        id_offset = np.argmax([t[1][1] for t in char_found])
        start = char_found[id_offset][2]
        if char_found[id_offset] not in words:
            words.append(char_found[id_offset])
    return words

tokenize(chars, dictionary) # give only [('yakkin', (0, 6), 6)]

我很难解决这个问题。请随意评论/建议

它看起来可能有点恶心,但它确实有效

def tokenize(string, dictionary):
    # sorting dictionary words by length
    # because we need to find longest word if its possible
    # like "yakkin" instead of "yak"
    sorted_dictionary = sorted(dictionary,
                               key=lambda word: len(word),
                               reverse=True)
    start = 0
    tokens = []
    while start < len(string):
        substring = string[start:]
        try:
            word = next(word
                        for word in sorted_dictionary
                        if substring.startswith(word))
            offset = len(word)
        except StopIteration:
            # no words from dictionary were found
            # at the beginning of substring,
            # looking for next appearance of dictionary words
            words_indexes = [substring.find(word)
                             for word in sorted_dictionary]
            # if word is not found, "str.find" method returns -1
            appeared_words_indexes = filter(lambda index: index > 0,
                                            words_indexes)
            try:
                offset = min(appeared_words_indexes)
            except ValueError:
                # an empty sequence was passed to "min" function
                # because there are no words from dictionary in substring
                offset = len(substring)
            word = substring[:offset]
        token = word, (start, start + offset), offset
        tokens.append(token)
        start += offset
    return tokens

您可以使用find来查找单词的起始索引,由于len,单词的长度是已知的。反复浏览字典中的每个单词,你的列表就完整了

def tokenize(chars, word_list):
    tokens = []
    for word in word_list:
        word_len = len(word)
        index = 0

        # skips words that appear in longer words
        skip = False
        for other_word in word_list:
            if word in other_word and len(other_word) > len(word):
                print("skipped word:", word)
                skip = True
        if skip:
            continue

        while index < len(chars):
            index = chars.find(word, index) # start search from index
            if index == -1: # find() returns -1 if not found
                break
            # Append the tuple and continue the search at the end of the word
            tokens.append((word, (index, word_len+index), word_len))
            index += word_len

    return tokens

你能详细说明一下你是如何得到“padthai”、6、13、7的吗?基本上,我想从第6个字符开始,直到字符结束。如果我找不到任何与我的字典相匹配的单词,我将转到第7个字符,依此类推。当我到达第13个字符时,我应该在搜索时找到新单词,直到结束。因此,我可以将字符6:13分块在一起。这几乎像我想要的格式,但在这种情况下,你也必须删除yak和kin,因为我们将选择yakkin。有办法吗?当然,看看我做的编辑。我刚刚添加了另一个迭代循环,用于检查是否在较长的单词中找到一个单词。只是提醒一下,这个方法是区分大小写的,尽管您可以使用lower或upper来忽略大小写。
def tokenize(chars, word_list):
    tokens = []
    for word in word_list:
        word_len = len(word)
        index = 0

        # skips words that appear in longer words
        skip = False
        for other_word in word_list:
            if word in other_word and len(other_word) > len(word):
                print("skipped word:", word)
                skip = True
        if skip:
            continue

        while index < len(chars):
            index = chars.find(word, index) # start search from index
            if index == -1: # find() returns -1 if not found
                break
            # Append the tuple and continue the search at the end of the word
            tokens.append((word, (index, word_len+index), word_len))
            index += word_len

    return tokens
>>>tokenize('yakkinpadthaikhaikoo', ['yak', 'kin', 'yakkin', 'khai', 'koo'])

skipped word: yak
skipped word: kin
[('yakkin', (0, 6), 6), 
 ('khai', (13, 17), 4), 
 ('koo', (17, 20), 3)]