多输入C#有效子串_C#_.net_String_.net 2.0

多输入C#有效子串

c# .net string

多输入C#有效子串,c#,.net,string,.net-2.0,C#,.net,String,.net 2.0,假设我不想使用外部库或十几行额外的代码（即clear code，notcode golf code），我能比string.Contains更好地处理输入字符串集合和要检查的关键字集合吗显然，可以使用objString.Contains（objString2）执行简单的子字符串检查。然而，在特殊情况下，有许多众所周知的算法能够做得更好，特别是在处理多个字符串时。但是在我的代码中加入这样的算法可能会增加长度和复杂性，所以我宁愿使用基于内置函数的某种快捷方式例如，输入将是字符串集合、正关键字集合和

假设我不想使用外部库或十几行额外的代码（即clear code，notcode golf code），我能比

string.Contains

更好地处理输入字符串集合和要检查的关键字集合吗

显然，可以使用

objString.Contains（objString2）

执行简单的子字符串检查。然而，在特殊情况下，有许多众所周知的算法能够做得更好，特别是在处理多个字符串时。但是在我的代码中加入这样的算法可能会增加长度和复杂性，所以我宁愿使用基于内置函数的某种快捷方式

例如，输入将是字符串集合、正关键字集合和负关键字集合。输出将是第一组关键字的子集，所有这些关键字至少有1个阳性关键字，但有0个阴性关键字

哦，请不要把正则表达式作为建议的解决方案

也许我的需求是相互排斥的（没有太多额外的代码，没有外部库或正则表达式，比String.Contains更好），但我想我会问

编辑：

很多人只是提供了一些愚蠢的改进，这些改进不会比智能地调用contains好多少，如果有的话。有些人试图更明智地调用Contains，这完全没有抓住我问题的重点。下面是一个要尝试解决的问题的例子。LBushkin的解决方案是一个例子，说明有人提供的解决方案可能比标准包含的解决方案渐进地好：

假设您有10000个长度为5-15个字符的正关键字，0个负关键字（这似乎让人困惑），以及1000000个字符串。检查1000000字符串是否至少包含1个正关键字

我想一个解决方案是创建一个FSA。另一种是在空格上划界并使用散列。

嗯，有一种Split（）方法可以对字符串调用。您可以使用split（）将输入字符串拆分为单词数组，然后使用关键字对单词进行一对一检查。但是，我不知道这是否或在什么情况下会比使用Contains（）更快。

如果添加此扩展方法：

public static bool ContainsAny(this string testString, IEnumerable<string> keywords)
{
    foreach (var keyword in keywords)
    {
        if (testString.Contains(keyword))
            return true;
    }
    return false;
}

这并不一定比执行contains检查快，只是它将有效地执行这些检查，因为LINQ的结果流可以防止任何不必要的contains调用。。。。另外，生成的代码是一行代码，这很好。

您对“否定和肯定”关键字的讨论有些混乱，可能需要一些澄清才能得到更完整的答案

与所有与性能相关的问题一样——您应该首先编写简单的版本，然后对其进行分析，以确定瓶颈在哪里——这些问题可能不直观且难以预测。话虽如此

优化搜索的一种方法可能是（如果您一直在搜索“单词”，而不是可能包含空格的短语），从您的字符串中构建搜索索引

搜索索引可以是排序数组（用于二进制搜索）或字典。字典可能会更快——因为字典在内部是具有O（1）查找的哈希映射，字典自然会消除搜索源中的重复值——从而减少需要执行的比较数量

通用搜索算法为：

对于要搜索的每个字符串：

将正在搜索的字符串标记为单个单词（由空格分隔）
将标记填充到搜索索引中（排序数组或字典）
在索引中搜索“负面关键字”，如果找到，请跳到下一个搜索字符串
在索引中搜索你的“积极关键词”，当找到一个时，将其添加到字典中，因为它们（你还可以跟踪单词出现的频率）

下面是一个在C#2.0中使用排序数组和二进制搜索的示例：

注意：您可以很容易地从

string[]

切换到

List

，我留给您

string[]FindKeyWordOccurrence（string[]stringsToSearch，
字符串[]正关键字，
字符串[]否定关键字）
{
Dictionary foundKeywords=新字典（）；
foreach（stringsToSearch中的字符串搜索）
{
//对输入进行标记化和排序，以加快搜索速度
string[]tokenizedList=searchIn.Split（“”）；
Sort（tokenizedList）；
//如果存在任何负面关键字，请跳到下一个搜索字符串。。。
foreach（negativeKeywords中的字符串关键字）
if（Array.BinarySearch（tokenizedList，negKeyword）>=0）
继续；//跳到下一个搜索字符串。。。
//对于每个积极的关键字，添加到字典以跟踪它
//我们也可以使用分类列表，但字典更容易
foreach（positiveKeyWords中的字符串posKeyword）
if（Array.BinarySearch（tokenizedList，posKeyword）>=0）
foundKeywords[posKeyword]=1；
}
//将字典中的键（我们找到的关键字）转换为数组。。。
string[]foundKeywordsArray=新字符串[foundKeywords.Keys.Count]；
foundKeywords.Keys.CopyTo（foundKeywordArray，0）；
返回foundKeywordsArray；
}

这是一个在C#3.0中使用基于词典的索引和LINQ的版本：

注意：这不是最简单的LINQ-y方法，我可以使用Union（）和SelectMany（）将整个算法编写为一个大型LINQ语句，但我发现这更容易理解

public IEnumerable FindOccurences（IEnumerable SearchString，
可数位置
var results = testStrings.Where(t => !t.ContainsAny(badKeywordCollection)).Where(t => t.ContainsAny(goodKeywordCollection));

string[] FindKeyWordOccurence( string[] stringsToSearch,
                               string[] positiveKeywords, 
                               string[] negativeKeywords )
{
   Dictionary<string,int> foundKeywords = new Dictionary<string,int>();
   foreach( string searchIn in stringsToSearch )
   {
       // tokenize and sort the input to make searches faster 
       string[] tokenizedList = searchIn.Split( ' ' );
       Array.Sort( tokenizedList );

       // if any negative keywords exist, skip to the next search string...
       foreach( string negKeyword in negativeKeywords )
           if( Array.BinarySearch( tokenizedList, negKeyword ) >= 0 )
               continue; // skip to next search string...

       // for each positive keyword, add to dictionary to keep track of it
       // we could have also used a SortedList, but the dictionary is easier
       foreach( string posKeyword in positiveKeyWords )
           if( Array.BinarySearch( tokenizedList, posKeyword ) >= 0 )
               foundKeywords[posKeyword] = 1; 
   }

   // convert the Keys in the dictionary (our found keywords) to an array...
   string[] foundKeywordsArray = new string[foundKeywords.Keys.Count];
   foundKeywords.Keys.CopyTo( foundKeywordArray, 0 );
   return foundKeywordsArray;
}

public IEnumerable<string> FindOccurences( IEnumerable<string> searchStrings,
                                           IEnumerable<string> positiveKeywords,
                                           IEnumerable<string> negativeKeywords )
    {
        var foundKeywordsDict = new Dictionary<string, int>();
        foreach( var searchIn in searchStrings )
        {
            // tokenize the search string...
            var tokenizedDictionary = searchIn.Split( ' ' ).ToDictionary( x => x );
            // skip if any negative keywords exist...
            if( negativeKeywords.Any( tokenizedDictionary.ContainsKey ) )
                continue;
            // merge found positive keywords into dictionary...
            // an example of where Enumerable.ForEach() would be nice...
            var found = positiveKeywords.Where(tokenizedDictionary.ContainsKey)
            foreach (var keyword in found)
                foundKeywordsDict[keyword] = 1;
        }
        return foundKeywordsDict.Keys;
    }

IList<Buckets> buckets = BuildBuckets(matchStrings);
int shortestLength = buckets[0].Length;
for (int i = 0; i < inputString.Length - shortestLength; i++) {
    foreach (Bucket b in buckets) {
        if (i + b.Length >= inputString.Length)
            continue;
        string candidate = inputString.Substring(i, b.Length);
        int hash = ComputeHash(candidate);

        foreach (MatchString match in b.MatchStrings) {
            if (hash != match.Hash)
                continue;
            if (candidate == match.String) {
                if (match.IsPositive) {
                    // positive case
                }
                else {
                    // negative case
                }
            }
        }
    }
}

    static void Main(string[] args)
    {
        string sIn = "This is a string that isn't nearly as long as it should be " +
            "but should still serve to prove an algorithm";
        string[] sFor = { "string", "as", "not" };
        Console.WriteLine(string.Join(", ", FindAny(sIn, sFor)));
    }

    private static string[] FindAny(string searchIn, string[] searchFor)
    {
        HashSet<String> hsIn = new HashSet<string>(searchIn.Split());
        HashSet<String> hsFor = new HashSet<string>(searchFor);
        return hsIn.Intersect(hsFor).ToArray();
    }

    private static bool FindAny(string searchIn, string[] searchFor)
    {
        HashSet<String> hsIn = new HashSet<string>(searchIn.Split());
        HashSet<String> hsFor = new HashSet<string>(searchFor);
        return hsIn.Overlaps(hsFor);
    }