Excel 在大型数据集上执行匹配/索引_Excel_Vba_Indexing_Match_Vlookup

Excel 在大型数据集上执行匹配/索引

excel vba indexing

Excel 在大型数据集上执行匹配/索引,excel,vba,indexing,match,vlookup,Excel,Vba,Indexing,Match,Vlookup,我有一个55.000行的数据集，其中35.000个电子邮件地址是唯一的，其中31.000个是唯一的，因此一些用户占用了多行。我需要找到这些用户的行并将它们添加到类对象中将电子邮件列加载到数组中并执行匹配/索引查找需要200秒。目前它是可以接受的，但对于200-500K数据集的预期用途来说，速度肯定不够快 Dim StartTime As Double Dim SecondsElapsed As Double StartTime = Timer Dim dict As Dictionary S

我有一个55.000行的数据集，其中35.000个电子邮件地址是唯一的，其中31.000个是唯一的，因此一些用户占用了多行。我需要找到这些用户的行并将它们添加到类对象中

将电子邮件列加载到数组中并执行匹配/索引查找需要200秒。目前它是可以接受的，但对于200-500K数据集的预期用途来说，速度肯定不够快

Dim StartTime As Double
Dim SecondsElapsed As Double
StartTime = Timer

Dim dict As Dictionary
Set dict = CreateObject("scripting.dictionary")

LastRow = Cells(Rows.Count, 1).End(xlUp).Row

Set UserRange = Range(Cells(2, 11), Cells(LastRow, 11))
For Each cell In UserRange
    dict(cell.value) = dict(cell.value) + 1
Next

Debug.Print "Number of users: " & dict.Count

UserArray = Range(Cells(2, 11), Cells(LastRow, 11))
UserArray = WorksheetFunction.Transpose(WorksheetFunction.Transpose(WorksheetFunction.Transpose(UserArray)))

For Each User In dict
    Dim UserIndex() As Variant
    ReDim UserIndex(1 To dict(User))
    For i = 1 To dict(User)
        Row = WorksheetFunction.Match(User, UserArray, 0)
        UserIndex(i) = Row
        UserArray(Row) = Empty
    Next
    For i = LBound(UserIndex) To UBound(UserIndex)
        Debug.Print User, UserIndex(i)
    Next
Next

SecondsElapsed = Round(Timer - StartTime, 2)
Debug.Print "This code ran successfully in " & SecondsElapsed & " seconds"

我可能会基于源代码构建一个基于块的索引（每个导入的文件+-10.000个条目都有一个开始行和结束行），并通过只查找适当的块来加快索引速度。但也许还有另一种方法？

这里有一种速度相当快的不同方法：

Sub Lister()

    Dim t, i, m, arr, rng, dict As Object, dictDupes As Object, usr, v
    Set dict = CreateObject("scripting.dictionary")
    Set dictDupes = CreateObject("scripting.dictionary")

    Set rng = Range("A1:A500000")
    'create some dummy data (0.5M rows)
    With rng
        .Formula = "=""USER_"" & ROUND(RAND()*5000,0) & ""_"" & ROUND(RAND()*3000,0)"
        .Value = .Value
    End With

    t = Timer

    arr = rng.Value
    For i = 1 To UBound(arr, 1)
        usr = arr(i, 1)
        If Not dict.exists(usr) Then
            dict.Add usr, i
        Else
            If Not dictDupes.exists(usr) Then dictDupes.Add usr, dict(usr)
            dictDupes(usr) = dictDupes(usr) & "|" & i
        End If
    Next i

    For Each usr In dictDupes
        v = dictDupes(usr)
        'Debug.Print "----" & usr & "---"
        'Debug.Print Join(Split(v, "|"), ", ")
    Next usr

    Debug.Print dict.Count, dictDupes.Count
    Debug.Print "Done in", Timer - t

End Sub

大约在20-25秒内完成

另一项说明：

如果您想使用Match，那么将数据保留在工作表上比对数组运行Match要快得多

Sub TestMatch()

    Dim t, i, m, arr, rng

    Set rng = Range("A1:A50000")
    With rng
        .Formula = "=ROUND(RAND()*30000,0)"
        .Value = .Value
    End With

    t = Timer
    For i = 1 To 10000
        m = Application.Match(i, rng, 0)
    Next i
    Debug.Print "sheet", Timer - t

    arr = rng.Value
    t = Timer
    For i = 1 To 10000
        arr = rng.Value
        m = Application.Match(i, arr, 0)
    Next i
    Debug.Print "array", Timer - t

End Sub

输出：

sheet          3.644531 
array          131.9453

因此，阵列速度大约慢了35倍

多亏了Tim，我找到了一个解决方案：

Dim dict As Dictionary
Set dict = CreateObject("scripting.dictionary")

LastRow = Cells(Rows.Count, 1).End(xlUp).Row

Set UserRange = Range(Cells(2, 11), Cells(LastRow, 11))
For Each cell In UserRange
    dict(cell.value) = dict(cell.value) + 1
Next

Debug.Print "Number of users: " & dict.Count

t = Timer

For Each User In dict
    Set Profile = New UserProfile
    Profile.Count = dict(User)
    Dim UserIndex() As Variant
    ReDim UserIndex(1 To dict(User))
    For i = 1 To dict(User)
        Row = WorksheetFunction.Match(User, UserRange, 0)
        UserIndex(i) = Row
    Next
    For i = LBound(UserIndex) To UBound(UserIndex)
        Dim Purchase() As Variant
        ReDim Purchase(1 To LastCol) As Variant
        Purchase = Range(Cells(UserIndex(i) + 1, 1), Cells(UserIndex(i) + 1, LastCol))
        Profile.Add Purchase
    Next
Next

Debug.Print "Match/Index loop completed in ", Timer - t

结果表明，在范围而不是数组上进行匹配要快得多。从一个范围中读取，而不是对数组执行工作表函数.Index也是如此。这些结果对我来说都是出乎意料的，因为我认为读/写工作簿通常会减慢速度。我还在UserProfile类中的Purchase数组读数中添加了一个（1，），以避免转置

整个55K数据集的评测只需23秒就完成了！

您是否承诺使用Excel？如果您将数据转储到Access中，您可以轻松/快速地运行一些SQL以返回重复数据。Excel VBA是我目前所拥有的全部。这是一段长而复杂的代码中的最后一步。工作表不是为此制作的，您需要的是实际的数据库e、另外，

Application.Transpose

也有局限性：此解决方案无法扩展。您可以做的一件事是在第一个循环中，迭代

范围（…）.Value

使用

For

循环，而不是使用

范围

并拉动

范围。Value

对于源范围中的每个单元格两次：您将对整个源数据访问一次工作表，而不是对

用户范围

中的每个单元格访问一次。请注意，您有许多隐式ActiveSheet引用在那里，以后可能会导致问题。如果不使用debug.print，而是在末尾将整个内容写入文本文件或类似的内容，则可以节省一点时间。将20k+转储到即时窗口似乎是不必要的。循环中的redim可能也没有帮助。字典只需0.42秒即可构建，因此访问工作簿不是这里的瓶颈。调试。打印我已经关闭了整个工作（它只在我不循环时才在那里）。这真的是循环和匹配/索引需要花费时间。有趣的发现Tim。我将我的用户放在UserRange中，并用UserArray在23秒和200秒的时间内完成匹配。现在看来真正的瓶颈是我循环中的第二个循环。需要做更多测试。这是工作表。索引需要花费很长时间。我将尝试找出一个可行的方法另外，我认为Match/Index经过优化，可以针对工作表上的数据执行操作（因为它们通常在工作表上使用）。针对数组使用它们很方便，但这样做时相对性能很差。不过，通常情况下，您不会对这么多数据进行操作，所以不明显。