C# 在完全匹配c的文本中查找所有关键字及其索引#_C#_Regex_Algorithm_Full Text Search

C# 在完全匹配c的文本中查找所有关键字及其索引#

c# regex algorithm

C# 在完全匹配c的文本中查找所有关键字及其索引#,c#,regex,algorithm,full-text-search,C#,Regex,Algorithm,Full Text Search,我有一个关键字列表和要搜索的文本。我需要得到每一个在文本中找到的关键字的开始索引和匹配必须是准确的。例如： keywords=>cat,dog text=> a catchy cat with a dogged dog 在这里，虽然只匹配'cat'和'dog'，但必须返回与索引的匹配，并且不应使用'cathy'和'dogged'之类的词进行匹配我试过，但它也符合“吸引人”和“固执”。如何使用c#使用带边界的正则表达式精确匹配关键字并返回文本中的索引位置 var results=

我有一个关键字列表和要搜索的文本。我需要得到每一个在文本中找到的关键字的开始索引和匹配必须是准确的。例如：

keywords=>cat,dog
text=> a catchy cat with a dogged dog

在这里，虽然只匹配'cat'和'dog'，但必须返回与索引的匹配，并且不应使用'cathy'和'dogged'之类的词进行匹配

我试过，但它也符合“吸引人”和“固执”。如何使用c#

使用带边界的正则表达式精确匹配关键字并返回文本中的索引位置

var results= keywords.Select(x=>
                               new
                               {
                                word=x,
                                indexes=Regex.Matches(input,@"\b"+x+@"\b")
                                             .Cast<Match>().Select(y=>y.Index)
                                             .ToList()    
                               }
                            );

使用带有边界的正则表达式

var results= keywords.Select(x=>
                               new
                               {
                                word=x,
                                indexes=Regex.Matches(input,@"\b"+x+@"\b")
                                             .Cast<Match>().Select(y=>y.Index)
                                             .ToList()    
                               }
                            );

您可以使用Aho-Corasick算法进行一些修改。对于所有关键字，在每个关键字的末尾附加单词分隔符（如空格、点、换行符等）

private List<int> GetIndexForKeyWord(string content,string key)
{
    int index = 0;
    List<int> indexes=new List<int>();
    while (index < content.Length && index >= 0)
    {
        index = content.IndexOf(key, index);
        if (index+key.Length==content.Length||index >= 0 && !char.IsLetter(content[index + key.Length]))
        {
            indexes.Add(index);
        }
        if(index!=-1)
            index++;
    }
    return indexes;
}

因此，如果您有m个关键字，并且文本有n种类型的分隔符，那么您将从n*m个单词构建trie树

添加分隔符后，它将与示例中的“吸引人”和“顽强”不匹配

编辑：

首先你最好了解AC算法

例如：

关键词=>猫、狗和文本=>一只吸引人的猫和一只顽强的狗

现在更改的关键字=>'cat'，'dog'，'cat\n'，'dog\n'（只需添加空格和换行符分隔符）

已更改文本=>“一只吸引人的猫和一只顽强的狗\n”

然后您可以使用standord Aho Corasick算法来字符串查找每个关键字的每个索引

假设文本长度为n，关键字总长度为m，Aho Corasick算法具有O（n+m）复杂度，足以处理大型文本和大型关键字集。

您可以使用Aho Corasick算法进行一些修改。对于所有关键字，将单词分隔符（例如空格、点、换行等）附加到每个关键字的末尾

private List<int> GetIndexForKeyWord(string content,string key)
{
    int index = 0;
    List<int> indexes=new List<int>();
    while (index < content.Length && index >= 0)
    {
        index = content.IndexOf(key, index);
        if (index+key.Length==content.Length||index >= 0 && !char.IsLetter(content[index + key.Length]))
        {
            indexes.Add(index);
        }
        if(index!=-1)
            index++;
    }
    return indexes;
}