C# 过滤器a IEnumerable<；字符串>；对于不需要的字符串_C#_Linq_Optimization

C# 过滤器a IEnumerable<；字符串>；对于不需要的字符串

c# linq optimization

C# 过滤器a IEnumerable<；字符串>；对于不需要的字符串,c#,linq,optimization,C#,Linq,Optimization,编辑：我收到了一些非常好的建议，我会努力解决它们，并在某个时候接受答案我有一个很大的字符串列表（800k），我想在最短的时间内过滤掉一个不需要的单词列表（最终是亵渎，但可能是任何东西）我最终希望看到的结果是这样一个列表 Hello,World,My,Name,Is,Yakyb,Shell 将成为 World,My,Name,Is,Yakyb 经核对无误后 Hell,Heaven. 到目前为止，我的代码是 var words = items .Distinct(

编辑：我收到了一些非常好的建议，我会努力解决它们，并在某个时候接受答案

我有一个很大的字符串列表（800k），我想在最短的时间内过滤掉一个不需要的单词列表（最终是亵渎，但可能是任何东西）

我最终希望看到的结果是这样一个列表

Hello,World,My,Name,Is,Yakyb,Shell

将成为

World,My,Name,Is,Yakyb

经核对无误后

Hell,Heaven.

到目前为止，我的代码是

 var words = items
            .Distinct()
            .AsParallel()
            .Where(x => !WordContains(x, WordsUnwanted));

public static bool WordContains(string word, List<string> words)
    {
        for (int i = 0; i < words.Count(); i++)
        {
            if (word.Contains(words[i]))
            {
                return true;
            }
        }
        return false;
    }

var words=items
.Distinct（）
.天冬酰胺（）
其中（x=>！WordContains（x，WordsUnwanted））；
公共静态bool WordContains（字符串字、列表字）
{
for（int i=0；i


目前，处理800k个单词大约需要2.3秒（并行9.5 w/o），一次性处理也没什么大不了的。然而，作为一个学习过程，有没有更快的处理方式
不需要的单词列表有100个单词长

这些单词中没有标点符号或空格
为删除所有列表中的重复项而采取的步骤
查看使用数组是否更快（不是）的步骤有趣的是，将参数字更改为字符串[]会使速度降低25%
添加AsParallel（）的步骤将时间缩短到约2.3秒
尝试名为的方法，但
除外

var words=newlist（）{“你好”、“嘿”、“猫”}；
var filter=newlist（）{“Cat”}；
var filtered=单词。除（filter）外；

还有：
var words = new List<string>() {"Hello","Hey","cat"};
var filter = new List<string>() {"Cat"};
// Perhaps a Except() here to match exact strings without substrings first?
var filtered = words.Where(i=> !ContainsAny(i,filter)).AsParallel();    
// You could experiment with AsParallel() and see 
// if running the query parallel yields faster results on larger string[]
// AsParallel probably not worth the cost unless list is large
public bool ContainsAny(string str, IEnumerable<string> values)
{
   if (!string.IsNullOrEmpty(str) || values.Any())
   {
       foreach (string value in values)
       {
             // Ignore case comparison from @TimSchmelter
             if (str.IndexOf(value, StringComparison.OrdinalIgnoreCase) != -1) return true;

             //if(str.ToLowerInvariant().Contains(value.ToLowerInvariant()))
             // return true;
       }
   }

   return false;
}

var words=newlist（）{“你好”、“嘿”、“猫”}；
var filter=newlist（）{“Cat”}；
//也许这里有一个Except（）来匹配没有子字符串的精确字符串？
var filtered=words.Where（i=>！ContainsAny（i，filter））.aspallel（）；
//你可以用AsParallel（）做实验，看看
//如果并行运行查询，则在较大字符串[]上生成更快的结果
//除非清单很大，否则阿斯帕莱尔可能不值得花这笔钱
public bool ContainsAny（字符串str，IEnumerable值）
{
如果（！string.IsNullOrEmpty（str）| | values.Any（））
{
foreach（值中的字符串值）
{
//忽略@TimSchmelter的案例比较
if（str.IndexOf（value，StringComparison.OrdinalIgnoreCase）！=-1）返回true；
//如果（str.ToLowerInvariant（）.Contains（value.ToLowerInvariant（）））
//返回true；
}
}
返回false；
}
啊，根据“坏”列表中的匹配项筛选单词。这是一个测试了许多程序员的认知的复杂问题。我的斯肯索普同学写了一篇论文
你真正想要避免的是一个用O（lm）来测试单词的解决方案，其中l是要测试的单词的长度，m是坏单词的数量。为了做到这一点，你需要一个解决方案，而不是通过恶语循环。我原以为正则表达式可以解决这个问题，但我忘记了，典型的实现有一个内部数据结构，每次交替都会增加。正如另一种解决方案所说，Aho Corasick是实现这一点的算法。标准实现会找到所有匹配项，因为您可以在第一次匹配时退出，所以您的实现会更高效。我认为这提供了一个理论上最优的解决方案。
我很想看看是否能想出一个更快的方法来实现这一点，但我只做了一个小小的优化。这是为了检查发生在另一个字符串中的字符串的索引，因为它首先似乎比“contains”稍微快一点，然后让您指定大小写不敏感（如果这对您有用的话）
下面是我写的一个测试类——我使用了超过100万个单词，并且在所有情况下都使用区分大小写的测试进行搜索。它测试您的方法，也是我正在尝试动态构建的正则表达式。你可以自己试一试，看看时间安排；正则表达式的运行速度不如您提供的方法快，但是我可能构建不正确。我使用（？I）before（word1 | word2…）来指定正则表达式中的大小写不敏感（我很想知道如何优化它-它可能受到经典回溯问题的影响！）
搜索方法（无论是正则表达式还是提供的原始方法）似乎随着更多“不需要的”单词的添加而逐渐变慢
无论如何-希望这个简单的测试能帮你一点忙：
    class Program
{


    static void Main(string[] args)
    {
        //Load your string here - I got war and peace from project guttenburg (http://www.gutenberg.org/ebooks/2600.txt.utf-8) and loaded twice to give 1.2 Million words
        List<string> loaded = File.ReadAllText(@"D:\Temp\2600.txt").Split(new string[] { " " }, StringSplitOptions.RemoveEmptyEntries).ToList();

        List<string> items = new List<string>();
        items.AddRange(loaded);
        items.AddRange(loaded);

        Console.WriteLine("Loaded {0} words", items.Count);

        Stopwatch sw = new Stopwatch();

        List<string> WordsUnwanted = new List<string> { "Hell", "Heaven", "and", "or", "big", "the", "when", "ur", "cat" };
        StringBuilder regexBuilder = new StringBuilder("(?i)(");

        foreach (string s in WordsUnwanted)
        {
            regexBuilder.Append(s);
            regexBuilder.Append("|");
        }
        regexBuilder.Replace("|", ")", regexBuilder.Length - 1, 1);
        string regularExpression = regexBuilder.ToString();
        Console.WriteLine(regularExpression);

        List<string> words = null;

        bool loop = true;

        while (loop)
        {
            Console.WriteLine("Enter test type - 1, 2, 3, 4 or Q to quit");
            ConsoleKeyInfo testType = Console.ReadKey();

            switch (testType.Key)
            {
                case ConsoleKey.D1:
                    sw.Reset();
                    sw.Start();
                    words = items
                        .Distinct()
                        .AsParallel()
                        .Where(x => !WordContains(x, WordsUnwanted)).ToList();

                    sw.Stop();
                    Console.WriteLine("Parallel (original) process took {0}ms and found {1} matching words", sw.ElapsedMilliseconds, words.Count);
                    words = null;
                    break;

                case ConsoleKey.D2:
                    sw.Reset();
                    sw.Start();
                    words = items
                        .Distinct()
                        .Where(x => !WordContains(x, WordsUnwanted)).ToList();

                    sw.Stop();
                    Console.WriteLine("Non-Parallel (original) process took {0}ms and found {1} matching words", sw.ElapsedMilliseconds, words.Count);
                    words = null;
                    break;

                case ConsoleKey.D3:
                    sw.Reset();
                    sw.Start();
                    words = items
                        .Distinct()
                        .AsParallel()
                        .Where(x => !Regex.IsMatch(x, regularExpression)).ToList();

                    sw.Stop();
                    Console.WriteLine("Non-Compiled regex (parallel) Process took {0}ms and found {1} matching words", sw.ElapsedMilliseconds, words.Count);
                    words = null;
                    break;

                case ConsoleKey.D4:
                    sw.Reset();
                    sw.Start();
                    words = items
                        .Distinct()
                        .Where(x => !Regex.IsMatch(x, regularExpression)).ToList();

                    sw.Stop();
                    Console.WriteLine("Non-Compiled regex (non-parallel) Process took {0}ms and found {1} matching words", sw.ElapsedMilliseconds, words.Count);
                    words = null;
                    break;

                case ConsoleKey.Q:
                    loop = false;
                    break;

                default:
                    continue;
            }
        }
    }

    public static bool WordContains(string word, List<string> words)
    {
        for (int i = 0; i < words.Count(); i++)
        {
            //Found that this was a bit fater and also lets you check the casing...!
            //if (word.Contains(words[i]))
            if (word.IndexOf(words[i], StringComparison.InvariantCultureIgnoreCase) >= 0)
                return true;
        }
        return false;
    }
}

类程序
{
静态void Main（字符串[]参数）
{
//把你的绳子放在这里-我从古腾堡计划那里得到了战争与和平(http://www.gutenberg.org/ebooks/2600.txt.utf-8)加载两次，得到120万字
List loaded=File.ReadAllText（@“D:\Temp\2600.txt”）.Split（新字符串[]{”“}，StringSplitOptions.RemoveEmptyEntries.ToList（）；
列表项=新列表（）；
items.AddRange（已加载）；
items.AddRange（已加载）；
WriteLine（“加载的{0}个字”，items.Count）；
秒表sw=新秒表（）；
List wordsunwant=新列表{“地狱”、“天堂”、“和”、“或”、“大”、“当”、“你”、“猫”}；
StringBuilder regexBuilder=新的StringBuilder（“（？i）（”）；
foreach（单词中的字符串s）
{
regexBuilder.Append；
regexBuilder.Append（“|”）；
}
regexBuilder.Replace（“|”，”），regexBuilder.Length-1，1）；
string regularExpression=regexBuilder.ToString（）；
Console.WriteLine（正则表达式）；
列表词=空；
布尔循环=真；
while（循环）
{
WriteLine（“输入测试类型-1、2、3、4或Q以退出”）；
ConsoleKeyInfo testType=Console.ReadKey（）；
开关（testType.Key）
{
案例ConsoleKey.D1：
sw.Reset（）；
西南
    class Program
{


    static void Main(string[] args)
    {
        //Load your string here - I got war and peace from project guttenburg (http://www.gutenberg.org/ebooks/2600.txt.utf-8) and loaded twice to give 1.2 Million words
        List<string> loaded = File.ReadAllText(@"D:\Temp\2600.txt").Split(new string[] { " " }, StringSplitOptions.RemoveEmptyEntries).ToList();

        List<string> items = new List<string>();
        items.AddRange(loaded);
        items.AddRange(loaded);

        Console.WriteLine("Loaded {0} words", items.Count);

        Stopwatch sw = new Stopwatch();

        List<string> WordsUnwanted = new List<string> { "Hell", "Heaven", "and", "or", "big", "the", "when", "ur", "cat" };
        StringBuilder regexBuilder = new StringBuilder("(?i)(");

        foreach (string s in WordsUnwanted)
        {
            regexBuilder.Append(s);
            regexBuilder.Append("|");
        }
        regexBuilder.Replace("|", ")", regexBuilder.Length - 1, 1);
        string regularExpression = regexBuilder.ToString();
        Console.WriteLine(regularExpression);

        List<string> words = null;

        bool loop = true;

        while (loop)
        {
            Console.WriteLine("Enter test type - 1, 2, 3, 4 or Q to quit");
            ConsoleKeyInfo testType = Console.ReadKey();

            switch (testType.Key)
            {
                case ConsoleKey.D1:
                    sw.Reset();
                    sw.Start();
                    words = items
                        .Distinct()
                        .AsParallel()
                        .Where(x => !WordContains(x, WordsUnwanted)).ToList();

                    sw.Stop();
                    Console.WriteLine("Parallel (original) process took {0}ms and found {1} matching words", sw.ElapsedMilliseconds, words.Count);
                    words = null;
                    break;

                case ConsoleKey.D2:
                    sw.Reset();
                    sw.Start();
                    words = items
                        .Distinct()
                        .Where(x => !WordContains(x, WordsUnwanted)).ToList();

                    sw.Stop();
                    Console.WriteLine("Non-Parallel (original) process took {0}ms and found {1} matching words", sw.ElapsedMilliseconds, words.Count);
                    words = null;
                    break;

                case ConsoleKey.D3:
                    sw.Reset();
                    sw.Start();
                    words = items
                        .Distinct()
                        .AsParallel()
                        .Where(x => !Regex.IsMatch(x, regularExpression)).ToList();

                    sw.Stop();
                    Console.WriteLine("Non-Compiled regex (parallel) Process took {0}ms and found {1} matching words", sw.ElapsedMilliseconds, words.Count);
                    words = null;
                    break;

                case ConsoleKey.D4:
                    sw.Reset();
                    sw.Start();
                    words = items
                        .Distinct()
                        .Where(x => !Regex.IsMatch(x, regularExpression)).ToList();

                    sw.Stop();
                    Console.WriteLine("Non-Compiled regex (non-parallel) Process took {0}ms and found {1} matching words", sw.ElapsedMilliseconds, words.Count);
                    words = null;
                    break;

                case ConsoleKey.Q:
                    loop = false;
                    break;

                default:
                    continue;
            }
        }
    }

    public static bool WordContains(string word, List<string> words)
    {
        for (int i = 0; i < words.Count(); i++)
        {
            //Found that this was a bit fater and also lets you check the casing...!
            //if (word.Contains(words[i]))
            if (word.IndexOf(words[i], StringComparison.InvariantCultureIgnoreCase) >= 0)
                return true;
        }
        return false;
    }
}

var words = new HashSet<string>(items) //this uses HashCodes
        .AsParallel()...

var filter = new HashSet<string>(
    new[] {"hello", "of", "this", "and", "for", "is", 
        "bye", "the", "see", "in", "an", 
        "top", "v", "t", "e", "a" }); 

var list = new HashSet<string> (items)
            .AsParallel()
            .Where(x => !filter.Contains(new PorterStemmer().Stem(x)))
            .ToList();