List VB.NET-根据另一个列表筛选大型通用列表

List VB.NET-根据另一个列表筛选大型通用列表,list,generics,filtering,List,Generics,Filtering,我有两个字符串类型的通用列表,第一个包含大约1000000个术语,第二个包含大约100000个关键字。第一个列表中的术语可能包含也可能不包含第二个列表中的关键字。我需要从第二个列表中分离出第一个列表中不包含任何关键字的术语。 目前我正在这样做(VB.NET和framework 3.5): 不用说,这需要永远的时间。实现这一目标的最快方法是什么?也许使用字典?任何提示都会有所帮助一个简单的关键字字典在这里不起作用,因为您正在做的是包含检查,而不仅仅是简单的相等性检查。您可以采取的一种方法是将搜索词

我有两个字符串类型的通用列表,第一个包含大约1000000个术语,第二个包含大约100000个关键字。第一个列表中的术语可能包含也可能不包含第二个列表中的关键字。我需要从第二个列表中分离出第一个列表中不包含任何关键字的术语。 目前我正在这样做(VB.NET和framework 3.5):


不用说,这需要永远的时间。实现这一目标的最快方法是什么?也许使用字典?任何提示都会有所帮助

一个简单的关键字字典在这里不起作用,因为您正在做的是包含检查,而不仅仅是简单的相等性检查。您可以采取的一种方法是将搜索词组合成一棵树。树的帮助程度取决于搜索词中有多少重叠。我将一个基本的树实现(无需太多测试)作为起点:

Public Class WordSearchTree

    Private ReadOnly _branches As New Dictionary(Of Char, WordSearchTree)

    Public Function WordContainsTerm(ByVal word As String) As Boolean
        Return Not String.IsNullOrEmpty(word) AndAlso _
               Enumerable.Range(0, word.Length - 1) _
                         .Any(Function(i) WordContainsInternal(word, i))
    End Function

    Private Function WordContainsInternal(ByVal word As String, ByVal charIndex As Integer) As Boolean
        Return _branches.Count = 0 OrElse _
               (_branches.ContainsKey(word(charIndex)) AndAlso _
                charIndex < word.Length - 1 AndAlso _
                _branches(word(charIndex)).WordContainsInternal(word, charIndex + 1))
    End Function

    Public Shared Function BuildTree(ByVal words As IEnumerable(Of String)) As WordSearchTree
        If words Is Nothing Then Throw New ArgumentNullException("words")
        Dim ret As New WordSearchTree()
        For Each w In words
            Dim curTree As WordSearchTree = ret
            For Each c In w
                If Not curTree._branches.ContainsKey(c) Then
                    curTree._branches.Add(c, New WordSearchTree())
                End If
                curTree = curTree._branches(c)
            Next
        Next
        Return ret
    End Function

End Class

首先,如果您只是检查字符串是否包含某个子字符串,那么请使用
string.contains
方法,而不是
string.IndexOf
。我刚刚检查了Dictionary类,虽然我可以轻松地从每个术语创建一个键/值对,但问题是我必须有重复的键,这不好
Public Class WordSearchTree

    Private ReadOnly _branches As New Dictionary(Of Char, WordSearchTree)

    Public Function WordContainsTerm(ByVal word As String) As Boolean
        Return Not String.IsNullOrEmpty(word) AndAlso _
               Enumerable.Range(0, word.Length - 1) _
                         .Any(Function(i) WordContainsInternal(word, i))
    End Function

    Private Function WordContainsInternal(ByVal word As String, ByVal charIndex As Integer) As Boolean
        Return _branches.Count = 0 OrElse _
               (_branches.ContainsKey(word(charIndex)) AndAlso _
                charIndex < word.Length - 1 AndAlso _
                _branches(word(charIndex)).WordContainsInternal(word, charIndex + 1))
    End Function

    Public Shared Function BuildTree(ByVal words As IEnumerable(Of String)) As WordSearchTree
        If words Is Nothing Then Throw New ArgumentNullException("words")
        Dim ret As New WordSearchTree()
        For Each w In words
            Dim curTree As WordSearchTree = ret
            For Each c In w
                If Not curTree._branches.ContainsKey(c) Then
                    curTree._branches.Add(c, New WordSearchTree())
                End If
                curTree = curTree._branches(c)
            Next
        Next
        Return ret
    End Function

End Class
Dim keys As WordSearchTree = WordSearchTree.Build(keywordList)
termList.RemoveAll(AddressOf keys.WordContainsTerm)