如何在python中高效地搜索字符串中的列表元素

如何在python中高效地搜索字符串中的列表元素,python,list,Python,List,我有一个概念列表(myconcepts)和一个句子列表(句子),如下所示 concepts = [['natural language processing', 'text mining', 'texts', 'nlp'], ['advanced data mining', 'data mining', 'data'], ['discourse analysis', 'learning analytics', 'mooc']] sentences = ['data mining and te

我有一个概念列表(
myconcepts
)和一个句子列表(
句子
),如下所示

concepts = [['natural language processing', 'text mining', 'texts', 'nlp'], ['advanced data mining', 'data mining', 'data'], ['discourse analysis', 'learning analytics', 'mooc']]


sentences = ['data mining and text mining', 'nlp is mainly used by discourse analysis community', 'data mining in python is fun', 'mooc data analysis involves texts', 'data and data mining are both very interesting']
简而言之,我想在
句子中找到
概念。更具体地说,在
概念
(例如,
['natural language processing','text mining','text','nlp']
)中给出一个列表,我想在句子中识别这些概念,并用它的第一个元素(即
natural language processing
)替换它们

示例: 因此,如果我们考虑句子<代码>数据挖掘和文本挖掘< /代码>;结果应该是
高级数据挖掘和自然语言处理
。(因为
数据挖掘
文本挖掘
的首要元素分别是
高级数据挖掘
自然语言处理

上述虚拟数据的结果应为:

['advanced data mining and natural language processing', 'natural language processing is mainly used by discourse analysis community', 'advanced data mining in python is fun', 'discourse analysis advanced data mining analysis involves natural language processing', 'advanced data mining and advanced data mining are both very interesting']
我目前正在使用regex执行此操作,如下所示:

concepts_re = []

for item in sorted_wikipedia_redirects:
        item_re = "|".join(re.escape(item) for item in item)
        concepts_re.append(item_re)

sentences_mapping = []

for sentence in sentences:
    for terms in concepts:
        if len(terms) > 1:
            for item in terms:
                if item in sentence:
                    sentence = re.sub(concepts_re[concepts.index(terms)], item[0], sentence)
    sentences_mapping.append(sentence)
在我的真实数据集中,我有大约800万个
概念。因此,我的方法效率很低,处理一句话需要5分钟。我想知道在python中是否有任何有效的方法来实现这一点

对于那些想要处理一长串
概念来测量时间的人,我在此附上一长串:

如果需要,我很乐意提供更多详细信息。

使用一种方法

如果您的数据已被清理,请跳过此步骤

首先,清理数据,将所有空白字符替换为您知道不属于任何概念或句子的任何字符

然后为所有句子建立后缀数组。每个句子都需要O(nLogn)时间。很少有算法可以在O(n)时间内使用

一旦为所有句子准备好后缀数组,只需对概念执行二进制搜索

您可以使用LCP阵列进一步优化搜索。参考:

使用LCP和后缀数组,搜索的时间复杂度可以降低到O(n)

编辑:
这种方法通常用于基因组的序列比对,也很流行。您应该很容易找到适合您的实现。

下面提供的解决方案在运行时大约有O(n)复杂性,其中n是每个句子中的标记数

对于500万个句子和您的
concepts.txt
它在约30秒内执行所需的操作,请参阅第三节中的基本测试。

当涉及到空间复杂性时,您必须保持嵌套的字典结构(现在让我们这样简化它),比如说它是O(c*u),其中u唯一的标记,用于一定长度的概念(令牌),而c是概念的长度

很难精确地指出确切的复杂性,但它与此非常相似(对于您的示例数据和您提供的[
concepts.txt
]
,这些数据非常准确,但我们将在执行过程中获得血淋淋的细节)

我假设您可以在空白处分割概念和句子,如果不是这样的话,我建议您看一看,这提供了更智能的方法来标记数据

1.介绍 让我们以你为例:

concepts = [
    ["natural language processing", "text mining", "texts", "nlp"],
    ["advanced data mining", "data mining", "data"],
    ["discourse analysis", "learning analytics", "mooc"],
]
正如您所说,概念中的每个元素都必须映射到第一个元素,因此,在Pythonish中,它将大致沿着以下几行进行:

for concept in concepts:
    concept[1:] = concept[0]
如果所有概念的标记长度都等于一(这里不是这种情况),并且都是唯一的,那么这项任务将很容易完成。让我们关注第二个案例和
概念的一个特殊(稍加修改)示例来了解我的观点:

["advanced data mining", "data something", "data"]
在这里,
数据
将被映射到
高级数据挖掘
但是
数据某些东西
,包括
数据
,应该在它之前被映射。如果我理解正确,你会想要这句话:

"Here is data something and another data"
要映射到:

"Here is advanced data mapping and another advanced data mining"
而不是天真的方法:

"Here is advanced data mapping something and another advanced data mining"
参见第二个示例,我们只映射了
数据,而不是
数据某物

为了对数据进行优先级排序,我使用了一个填充了字典的数组结构,其中数组中较早的概念是那些较长的标记

继续我们的示例,这样的数组如下所示:

structure = [
    {"data": {"something": "advanced data mining"}},
    {"data": "advanced data mining"},
]
请注意,如果我们按此顺序遍历标记(例如,首先遍历具有连续标记的第一个字典,如果没有找到匹配项,则转到第二个字典,依此类推),我们将首先获得最长的概念

2.代码 好的,我希望你了解基本的想法(如果没有,请在下面发表评论,我将尝试更详细地解释不清楚的部分)

免责声明:我并不特别为这段代码感到自豪,但它完成了任务,我想可能会更糟

2.1分层字典 首先,让我们获得最长的概念令牌(不包括第一个元素,因为它是我们的目标,我们永远不必更改它):

使用这些信息,我们可以通过创建尽可能多的字典来初始化我们的结构,这些字典包含不同长度的概念(在上面的示例中,它是2,因此适用于所有数据。但任何长度的概念都可以):

注意,我正在将每个概念的长度添加到数组中,在我看来,当涉及到遍历时,这种方法更容易,不过在对实现进行一些更改之后,您可以不使用它

现在,当我们拥有这些辅助函数时,我们可以从概念列表中创建结构:

def create_hierarchical_dictionaries(concepts: List[List[str]]):
    # Initialization
    longest = get_longest(concepts)
    hierarchical_dictionaries = init_hierarchical_dictionaries(longest)

    for concept in concepts:
        for text in concept[1:]:
            tokens = text.split()
            # Initialize dictionary; get the one with corresponding length.
            # The longer, the earlier it is in the hierarchy
            current_dictionary = hierarchical_dictionaries[longest - len(tokens)][1]
            # All of the tokens except the last one are another dictionary mapping to
            # the next token in concept.
            for token in tokens[:-1]:
                current_dictionary[token] = {}
                current_dictionary = current_dictionary[token]

            # Last token is mapped to the first concept
            current_dictionary[tokens[-1]] = concept[0].split()

    return hierarchical_dictionaries
这个函数将创建我们的分层字典,请参阅源代码中的注释以获得一些解释。您可能需要创建
def init_hierarchical_dictionaries(longest: int):
    return [(length, {}) for length in reversed(range(longest))]
def create_hierarchical_dictionaries(concepts: List[List[str]]):
    # Initialization
    longest = get_longest(concepts)
    hierarchical_dictionaries = init_hierarchical_dictionaries(longest)

    for concept in concepts:
        for text in concept[1:]:
            tokens = text.split()
            # Initialize dictionary; get the one with corresponding length.
            # The longer, the earlier it is in the hierarchy
            current_dictionary = hierarchical_dictionaries[longest - len(tokens)][1]
            # All of the tokens except the last one are another dictionary mapping to
            # the next token in concept.
            for token in tokens[:-1]:
                current_dictionary[token] = {}
                current_dictionary = current_dictionary[token]

            # Last token is mapped to the first concept
            current_dictionary[tokens[-1]] = concept[0].split()

    return hierarchical_dictionaries
def embed_sentences(sentences: List[str], hierarchical_dictionaries):
    return (traverse(sentence, hierarchical_dictionaries) for sentence in sentences)
def traverse(sentence: str, hierarchical_dictionaries):
    # Get all tokens in the sentence
    tokens = sentence.split()
    output_sentence = []
    # Initialize index to the first token
    index = 0
    # Until any tokens left to check for concepts
    while index < len(tokens):
        # Iterate over hierarchical dictionaries (elements of the array)
        for hierarchical_dictionary_tuple in hierarchical_dictionaries:
            # New index is returned based on match and token-wise length of concept
            index, concept = traverse_through_dictionary(
                index, tokens, hierarchical_dictionary_tuple
            )
            # Concept was found in current hierarchical_dictionary_tuple, let's add it
            # to output
            if concept is not None:
                output_sentence.extend(concept)
                # No need to check other hierarchical dictionaries for matching concept
                break
        # Token (and it's next tokens) do not match with any concept, return original
        else:
            output_sentence.append(tokens[index])
        # Increment index in order to move to the next token
        index += 1

    # Join list of tokens into a sentence
    return " ".join(output_sentence)
def traverse_through_dictionary(index, tokens, hierarchical_dictionary_tuple):
    # Get the level of nested dictionaries and initial dictionary
    length, current_dictionary = hierarchical_dictionary_tuple
    # inner_index will loop through tokens until match or no match was found
    inner_index = index
    for _ in range(length):
        # Get next nested dictionary and move inner_index to the next token
        current_dictionary = current_dictionary.get(tokens[inner_index])
        inner_index += 1
        # If no match was found in any level of dictionary
        # Return current index in sentence and None representing lack of concept.
        if current_dictionary is None or inner_index >= len(tokens):
            return index, None

    # If everything went fine through all nested dictionaries, check whether
    # last token corresponds to concept
    concept = current_dictionary.get(tokens[inner_index])
    if concept is None:
        return index, None
    # If so, return inner_index (we have moved length tokens, so we have to update it)
    return inner_index, concept
import ast
import time
from typing import List


def get_longest(concepts: List[List[str]]):
    return max(len(text.split()) for concept in concepts for text in concept[1:])


def init_hierarchical_dictionaries(longest: int):
    return [(length, {}) for length in reversed(range(longest))]


def create_hierarchical_dictionaries(concepts: List[List[str]]):
    # Initialization
    longest = get_longest(concepts)
    hierarchical_dictionaries = init_hierarchical_dictionaries(longest)

    for concept in concepts:
        for text in concept[1:]:
            tokens = text.split()
            # Initialize dictionary; get the one with corresponding length.
            # The longer, the earlier it is in the hierarchy
            current_dictionary = hierarchical_dictionaries[longest - len(tokens)][1]
            # All of the tokens except the last one are another dictionary mapping to
            # the next token in concept.
            for token in tokens[:-1]:
                current_dictionary[token] = {}
                current_dictionary = current_dictionary[token]

            # Last token is mapped to the first concept
            current_dictionary[tokens[-1]] = concept[0].split()

    return hierarchical_dictionaries


def traverse_through_dictionary(index, tokens, hierarchical_dictionary_tuple):
    # Get the level of nested dictionaries and initial dictionary
    length, current_dictionary = hierarchical_dictionary_tuple
    # inner_index will loop through tokens until match or no match was found
    inner_index = index
    for _ in range(length):
        # Get next nested dictionary and move inner_index to the next token
        current_dictionary = current_dictionary.get(tokens[inner_index])
        inner_index += 1
        # If no match was found in any level of dictionary
        # Return current index in sentence and None representing lack of concept.
        if current_dictionary is None or inner_index >= len(tokens):
            return index, None

    # If everything went fine through all nested dictionaries, check whether
    # last token corresponds to concept
    concept = current_dictionary.get(tokens[inner_index])
    if concept is None:
        return index, None
    # If so, return inner_index (we have moved length tokens, so we have to update it)
    return inner_index, concept


def traverse(sentence: str, hierarchical_dictionaries):
    # Get all tokens in the sentence
    tokens = sentence.split()
    output_sentence = []
    # Initialize index to the first token
    index = 0
    # Until any tokens left to check for concepts
    while index < len(tokens):
        # Iterate over hierarchical dictionaries (elements of the array)
        for hierarchical_dictionary_tuple in hierarchical_dictionaries:
            # New index is returned based on match and token-wise length of concept
            index, concept = traverse_through_dictionary(
                index, tokens, hierarchical_dictionary_tuple
            )
            # Concept was found in current hierarchical_dictionary_tuple, let's add it
            # to output
            if concept is not None:
                output_sentence.extend(concept)
                # No need to check other hierarchical dictionaries for matching concept
                break
        # Token (and it's next tokens) do not match with any concept, return original
        else:
            output_sentence.append(tokens[index])
        # Increment index in order to move to the next token
        index += 1

    # Join list of tokens into a sentence
    return " ".join(output_sentence)


def embed_sentences(sentences: List[str], hierarchical_dictionaries):
    return (traverse(sentence, hierarchical_dictionaries) for sentence in sentences)


def sanity_check():
    concepts = [
        ["natural language processing", "text mining", "texts", "nlp"],
        ["advanced data mining", "data mining", "data"],
        ["discourse analysis", "learning analytics", "mooc"],
    ]
    sentences = [
        "data mining and text mining",
        "nlp is mainly used by discourse analysis community",
        "data mining in python is fun",
        "mooc data analysis involves texts",
        "data and data mining are both very interesting",
    ]

    targets = [
        "advanced data mining and natural language processing",
        "natural language processing is mainly used by discourse analysis community",
        "advanced data mining in python is fun",
        "discourse analysis advanced data mining analysis involves natural language processing",
        "advanced data mining and advanced data mining are both very interesting",
    ]

    hierarchical_dictionaries = create_hierarchical_dictionaries(concepts)

    results = list(embed_sentences(sentences, hierarchical_dictionaries))
    if results == targets:
        print("Correct results")
    else:
        print("Incorrect results")


def speed_check():
    with open("./concepts.txt") as f:
        concepts = ast.literal_eval(f.read())

    initial_sentences = [
        "data mining and text mining",
        "nlp is mainly used by discourse analysis community",
        "data mining in python is fun",
        "mooc data analysis involves texts",
        "data and data mining are both very interesting",
    ]

    sentences = initial_sentences.copy()

    for i in range(1_000_000):
        sentences += initial_sentences

    start = time.time()
    hierarchical_dictionaries = create_hierarchical_dictionaries(concepts)
    middle = time.time()
    letters = []
    for result in embed_sentences(sentences, hierarchical_dictionaries):
        letters.append(result[0].capitalize())
    end = time.time()
    print(f"Time for hierarchical creation {(middle-start) * 1000.0} ms")
    print(f"Time for embedding {(end-middle) * 1000.0} ms")
    print(f"Overall time elapsed {(end-start) * 1000.0} ms")


def main():
    sanity_check()
    speed_check()


if __name__ == "__main__":
    main()
Time for hierarchical creation 107.71822929382324 ms
Time for embedding 30460.427284240723 ms
Overall time elapsed 30568.145513534546 ms