C# 过滤器a IEnumerable<;字符串>;对于不需要的字符串

C# 过滤器a IEnumerable<;字符串>;对于不需要的字符串,c#,linq,optimization,C#,Linq,Optimization,编辑:我收到了一些非常好的建议,我会努力解决它们,并在某个时候接受答案 我有一个很大的字符串列表(800k),我想在最短的时间内过滤掉一个不需要的单词列表(最终是亵渎,但可能是任何东西) 我最终希望看到的结果是这样一个列表 Hello,World,My,Name,Is,Yakyb,Shell 将成为 World,My,Name,Is,Yakyb 经核对无误后 Hell,Heaven. 到目前为止,我的代码是 var words = items .Distinct(

编辑:我收到了一些非常好的建议,我会努力解决它们,并在某个时候接受答案

我有一个很大的字符串列表(800k),我想在最短的时间内过滤掉一个不需要的单词列表(最终是亵渎,但可能是任何东西)

我最终希望看到的结果是这样一个列表

Hello,World,My,Name,Is,Yakyb,Shell
将成为

World,My,Name,Is,Yakyb
经核对无误后

Hell,Heaven.
到目前为止,我的代码是

 var words = items
            .Distinct()
            .AsParallel()
            .Where(x => !WordContains(x, WordsUnwanted));

public static bool WordContains(string word, List<string> words)
    {
        for (int i = 0; i < words.Count(); i++)
        {
            if (word.Contains(words[i]))
            {
                return true;
            }
        }
        return false;
    }
var words=items
.Distinct()
.天冬酰胺()
其中(x=>!WordContains(x,WordsUnwanted));
公共静态bool WordContains(字符串字、列表字)
{
for(int i=0;i
目前,处理800k个单词大约需要2.3秒(并行9.5 w/o),一次性处理也没什么大不了的。然而,作为一个学习过程,有没有更快的处理方式

不需要的单词列表有100个单词长
这些单词中没有标点符号或空格

  • 为删除所有列表中的重复项而采取的步骤
  • 查看使用数组是否更快(不是)的步骤有趣的是,将参数字更改为字符串[]会使速度降低25%
  • 添加AsParallel()的步骤将时间缩短到约2.3秒

  • 尝试名为
    的方法,但
    除外

    var words=newlist(){“你好”、“嘿”、“猫”};
    var filter=newlist(){“Cat”};
    var filtered=单词。除(filter)外;
    
    还有:

    var words = new List<string>() {"Hello","Hey","cat"};
    var filter = new List<string>() {"Cat"};
    // Perhaps a Except() here to match exact strings without substrings first?
    var filtered = words.Where(i=> !ContainsAny(i,filter)).AsParallel();    
    // You could experiment with AsParallel() and see 
    // if running the query parallel yields faster results on larger string[]
    // AsParallel probably not worth the cost unless list is large
    public bool ContainsAny(string str, IEnumerable<string> values)
    {
       if (!string.IsNullOrEmpty(str) || values.Any())
       {
           foreach (string value in values)
           {
                 // Ignore case comparison from @TimSchmelter
                 if (str.IndexOf(value, StringComparison.OrdinalIgnoreCase) != -1) return true;
    
                 //if(str.ToLowerInvariant().Contains(value.ToLowerInvariant()))
                 // return true;
           }
       }
    
       return false;
    }
    
    var words=newlist(){“你好”、“嘿”、“猫”};
    var filter=newlist(){“Cat”};
    //也许这里有一个Except()来匹配没有子字符串的精确字符串?
    var filtered=words.Where(i=>!ContainsAny(i,filter)).aspallel();
    //你可以用AsParallel()做实验,看看
    //如果并行运行查询,则在较大字符串[]上生成更快的结果
    //除非清单很大,否则阿斯帕莱尔可能不值得花这笔钱
    public bool ContainsAny(字符串str,IEnumerable值)
    {
    如果(!string.IsNullOrEmpty(str)| | values.Any())
    {
    foreach(值中的字符串值)
    {
    //忽略@TimSchmelter的案例比较
    if(str.IndexOf(value,StringComparison.OrdinalIgnoreCase)!=-1)返回true;
    //如果(str.ToLowerInvariant().Contains(value.ToLowerInvariant()))
    //返回true;
    }
    }
    返回false;
    }
    
    啊,根据“坏”列表中的匹配项筛选单词。这是一个测试了许多程序员的认知的复杂问题。我的斯肯索普同学写了一篇论文


    你真正想要避免的是一个用O(lm)来测试单词的解决方案,其中l是要测试的单词的长度,m是坏单词的数量。为了做到这一点,你需要一个解决方案,而不是通过恶语循环。我原以为正则表达式可以解决这个问题,但我忘记了,典型的实现有一个内部数据结构,每次交替都会增加。正如另一种解决方案所说,Aho Corasick是实现这一点的算法。标准实现会找到所有匹配项,因为您可以在第一次匹配时退出,所以您的实现会更高效。我认为这提供了一个理论上最优的解决方案。

    我很想看看是否能想出一个更快的方法来实现这一点,但我只做了一个小小的优化。这是为了检查发生在另一个字符串中的字符串的索引,因为它首先似乎比“contains”稍微快一点,然后让您指定大小写不敏感(如果这对您有用的话)

    下面是我写的一个测试类——我使用了超过100万个单词,并且在所有情况下都使用区分大小写的测试进行搜索。它测试您的方法,也是我正在尝试动态构建的正则表达式。你可以自己试一试,看看时间安排;正则表达式的运行速度不如您提供的方法快,但是我可能构建不正确。我使用(?I)before(word1 | word2…)来指定正则表达式中的大小写不敏感(我很想知道如何优化它-它可能受到经典回溯问题的影响!)

    搜索方法(无论是正则表达式还是提供的原始方法)似乎随着更多“不需要的”单词的添加而逐渐变慢

    无论如何-希望这个简单的测试能帮你一点忙:

        class Program
    {
    
    
        static void Main(string[] args)
        {
            //Load your string here - I got war and peace from project guttenburg (http://www.gutenberg.org/ebooks/2600.txt.utf-8) and loaded twice to give 1.2 Million words
            List<string> loaded = File.ReadAllText(@"D:\Temp\2600.txt").Split(new string[] { " " }, StringSplitOptions.RemoveEmptyEntries).ToList();
    
            List<string> items = new List<string>();
            items.AddRange(loaded);
            items.AddRange(loaded);
    
            Console.WriteLine("Loaded {0} words", items.Count);
    
            Stopwatch sw = new Stopwatch();
    
            List<string> WordsUnwanted = new List<string> { "Hell", "Heaven", "and", "or", "big", "the", "when", "ur", "cat" };
            StringBuilder regexBuilder = new StringBuilder("(?i)(");
    
            foreach (string s in WordsUnwanted)
            {
                regexBuilder.Append(s);
                regexBuilder.Append("|");
            }
            regexBuilder.Replace("|", ")", regexBuilder.Length - 1, 1);
            string regularExpression = regexBuilder.ToString();
            Console.WriteLine(regularExpression);
    
            List<string> words = null;
    
            bool loop = true;
    
            while (loop)
            {
                Console.WriteLine("Enter test type - 1, 2, 3, 4 or Q to quit");
                ConsoleKeyInfo testType = Console.ReadKey();
    
                switch (testType.Key)
                {
                    case ConsoleKey.D1:
                        sw.Reset();
                        sw.Start();
                        words = items
                            .Distinct()
                            .AsParallel()
                            .Where(x => !WordContains(x, WordsUnwanted)).ToList();
    
                        sw.Stop();
                        Console.WriteLine("Parallel (original) process took {0}ms and found {1} matching words", sw.ElapsedMilliseconds, words.Count);
                        words = null;
                        break;
    
                    case ConsoleKey.D2:
                        sw.Reset();
                        sw.Start();
                        words = items
                            .Distinct()
                            .Where(x => !WordContains(x, WordsUnwanted)).ToList();
    
                        sw.Stop();
                        Console.WriteLine("Non-Parallel (original) process took {0}ms and found {1} matching words", sw.ElapsedMilliseconds, words.Count);
                        words = null;
                        break;
    
                    case ConsoleKey.D3:
                        sw.Reset();
                        sw.Start();
                        words = items
                            .Distinct()
                            .AsParallel()
                            .Where(x => !Regex.IsMatch(x, regularExpression)).ToList();
    
                        sw.Stop();
                        Console.WriteLine("Non-Compiled regex (parallel) Process took {0}ms and found {1} matching words", sw.ElapsedMilliseconds, words.Count);
                        words = null;
                        break;
    
                    case ConsoleKey.D4:
                        sw.Reset();
                        sw.Start();
                        words = items
                            .Distinct()
                            .Where(x => !Regex.IsMatch(x, regularExpression)).ToList();
    
                        sw.Stop();
                        Console.WriteLine("Non-Compiled regex (non-parallel) Process took {0}ms and found {1} matching words", sw.ElapsedMilliseconds, words.Count);
                        words = null;
                        break;
    
                    case ConsoleKey.Q:
                        loop = false;
                        break;
    
                    default:
                        continue;
                }
            }
        }
    
        public static bool WordContains(string word, List<string> words)
        {
            for (int i = 0; i < words.Count(); i++)
            {
                //Found that this was a bit fater and also lets you check the casing...!
                //if (word.Contains(words[i]))
                if (word.IndexOf(words[i], StringComparison.InvariantCultureIgnoreCase) >= 0)
                    return true;
            }
            return false;
        }
    }
    
    类程序
    {
    静态void Main(字符串[]参数)
    {
    //把你的绳子放在这里-我从古腾堡计划那里得到了战争与和平(http://www.gutenberg.org/ebooks/2600.txt.utf-8)加载两次,得到120万字
    List loaded=File.ReadAllText(@“D:\Temp\2600.txt”).Split(新字符串[]{”“},StringSplitOptions.RemoveEmptyEntries.ToList();
    列表项=新列表();
    items.AddRange(已加载);
    items.AddRange(已加载);
    WriteLine(“加载的{0}个字”,items.Count);
    秒表sw=新秒表();
    List wordsunwant=新列表{“地狱”、“天堂”、“和”、“或”、“大”、“当”、“你”、“猫”};
    StringBuilder regexBuilder=新的StringBuilder(“(?i)(”);
    foreach(单词中的字符串s)
    {
    regexBuilder.Append;
    regexBuilder.Append(“|”);
    }
    regexBuilder.Replace(“|”,”),regexBuilder.Length-1,1);
    string regularExpression=regexBuilder.ToString();
    Console.WriteLine(正则表达式);
    列表词=空;
    布尔循环=真;
    while(循环)
    {
    WriteLine(“输入测试类型-1、2、3、4或Q以退出”);
    ConsoleKeyInfo testType=Console.ReadKey();
    开关(testType.Key)
    {
    案例ConsoleKey.D1:
    sw.Reset();
    西南
    
        class Program
    {
    
    
        static void Main(string[] args)
        {
            //Load your string here - I got war and peace from project guttenburg (http://www.gutenberg.org/ebooks/2600.txt.utf-8) and loaded twice to give 1.2 Million words
            List<string> loaded = File.ReadAllText(@"D:\Temp\2600.txt").Split(new string[] { " " }, StringSplitOptions.RemoveEmptyEntries).ToList();
    
            List<string> items = new List<string>();
            items.AddRange(loaded);
            items.AddRange(loaded);
    
            Console.WriteLine("Loaded {0} words", items.Count);
    
            Stopwatch sw = new Stopwatch();
    
            List<string> WordsUnwanted = new List<string> { "Hell", "Heaven", "and", "or", "big", "the", "when", "ur", "cat" };
            StringBuilder regexBuilder = new StringBuilder("(?i)(");
    
            foreach (string s in WordsUnwanted)
            {
                regexBuilder.Append(s);
                regexBuilder.Append("|");
            }
            regexBuilder.Replace("|", ")", regexBuilder.Length - 1, 1);
            string regularExpression = regexBuilder.ToString();
            Console.WriteLine(regularExpression);
    
            List<string> words = null;
    
            bool loop = true;
    
            while (loop)
            {
                Console.WriteLine("Enter test type - 1, 2, 3, 4 or Q to quit");
                ConsoleKeyInfo testType = Console.ReadKey();
    
                switch (testType.Key)
                {
                    case ConsoleKey.D1:
                        sw.Reset();
                        sw.Start();
                        words = items
                            .Distinct()
                            .AsParallel()
                            .Where(x => !WordContains(x, WordsUnwanted)).ToList();
    
                        sw.Stop();
                        Console.WriteLine("Parallel (original) process took {0}ms and found {1} matching words", sw.ElapsedMilliseconds, words.Count);
                        words = null;
                        break;
    
                    case ConsoleKey.D2:
                        sw.Reset();
                        sw.Start();
                        words = items
                            .Distinct()
                            .Where(x => !WordContains(x, WordsUnwanted)).ToList();
    
                        sw.Stop();
                        Console.WriteLine("Non-Parallel (original) process took {0}ms and found {1} matching words", sw.ElapsedMilliseconds, words.Count);
                        words = null;
                        break;
    
                    case ConsoleKey.D3:
                        sw.Reset();
                        sw.Start();
                        words = items
                            .Distinct()
                            .AsParallel()
                            .Where(x => !Regex.IsMatch(x, regularExpression)).ToList();
    
                        sw.Stop();
                        Console.WriteLine("Non-Compiled regex (parallel) Process took {0}ms and found {1} matching words", sw.ElapsedMilliseconds, words.Count);
                        words = null;
                        break;
    
                    case ConsoleKey.D4:
                        sw.Reset();
                        sw.Start();
                        words = items
                            .Distinct()
                            .Where(x => !Regex.IsMatch(x, regularExpression)).ToList();
    
                        sw.Stop();
                        Console.WriteLine("Non-Compiled regex (non-parallel) Process took {0}ms and found {1} matching words", sw.ElapsedMilliseconds, words.Count);
                        words = null;
                        break;
    
                    case ConsoleKey.Q:
                        loop = false;
                        break;
    
                    default:
                        continue;
                }
            }
        }
    
        public static bool WordContains(string word, List<string> words)
        {
            for (int i = 0; i < words.Count(); i++)
            {
                //Found that this was a bit fater and also lets you check the casing...!
                //if (word.Contains(words[i]))
                if (word.IndexOf(words[i], StringComparison.InvariantCultureIgnoreCase) >= 0)
                    return true;
            }
            return false;
        }
    }
    
    var words = new HashSet<string>(items) //this uses HashCodes
            .AsParallel()...
    
    var filter = new HashSet<string>(
        new[] {"hello", "of", "this", "and", "for", "is", 
            "bye", "the", "see", "in", "an", 
            "top", "v", "t", "e", "a" }); 
    
    var list = new HashSet<string> (items)
                .AsParallel()
                .Where(x => !filter.Contains(new PorterStemmer().Stem(x)))
                .ToList();