Excel 收集并小计变量二维数组中的重复行

Excel 收集并小计变量二维数组中的重复行,excel,multidimensional-array,vba,Excel,Multidimensional Array,Vba,我在Excel中有一个数据集,由20列和不同数量的行组成,范围从20000到50000 每行是一个项目集合,其中一列表示集合中项目的数量,另一列表示集合的总组合重量。 其中有些行在所有列中完全相同,而有些行除了数量和重量之外在所有列中都相同 我想创建一个宏,它在数据集中运行,并“堆叠”除数量和重量之外的所有其他参数上重复的行,然后将这两个参数相加 换言之,用于转换以下内容的宏: |Param1|Param2|...|Param18|Quantity|Weight| | A | 1 |

我在Excel中有一个数据集,由20列和不同数量的行组成,范围从20000到50000

每行是一个项目集合,其中一列表示集合中项目的数量,另一列表示集合的总组合重量。 其中有些行在所有列中完全相同,而有些行除了数量和重量之外在所有列中都相同

我想创建一个宏,它在数据集中运行,并“堆叠”除数量和重量之外的所有其他参数上重复的行,然后将这两个参数相加

换言之,用于转换以下内容的宏:

|Param1|Param2|...|Param18|Quantity|Weight|
| A    | 1    |...| C     | 5      | 12.5 |
| A    | 1    |...| C     | 2      |  5.0 |
| A    | 1    |...| C     | 3      |  7.5 |
| B    | 2    |...| C     | 1      |  2.3 |
| B    | 2    |...| C     | 2      |  4.6 |
为此:

|Param1|Param2|...|Param18|Quantity|Weight|
| A    | 1    |...| C     | 10     | 25.0 |
| B    | 2    |...| C     | 3      |  6.9 |
我知道这在一个简单的透视表中是可以做到的,但由于许多原因,在这种情况下这是不可行的

因为我处理的是一个大数据集,所以我希望一次将其全部加载到内存中,而不是逐行读写以提高性能(如本文第13条提示所述)。然而,我被困在如何对存储在内存中的数据进行行操作

到目前为止,我的代码如下所示:

Dim r, c, LastRow As Integer
Dim temp_range As Variant

LastRow = Cells(65536, 2).End(xlUp).Row

'Load the data set into memory
temp_range = Sheets("1.1").Range(Sheets("1.1").Cells(2, 1), Sheets("1.1").Cells(LastRow, 20)).Value

'Run through the data set from bottom to top and bulk identical rows together
For r = UBound(temp_range) To LBound(temp_range)
    For i = r - 1 To LBound(temp_range)

        'PSEUDO CODE START HERE
        If row temp_range(r) = row temp_range(i) Then
            temp_range(i,19) = temp_range(r,19) + temp_range(i,19)
            temp_range(i,20) = temp_range(r,19) + temp_range(i,20)
            Delete row temp_range(r)
            Exit For
        End if

        'PSEUDO CODE END HERE

    Next i
Next r

我被困在代码中突出的伪代码部分。我只是不知道如何比较行,将数量和重量从一行复制到另一行,然后删除内存中包含范围的变量中的重复行。

使用Excel中的Microsoft Query(SQL):

这样的数据操作非常适合SQL查询。无需使用VBA逐行查看数据:

SELECT S1.Param1, S1.Param2, S1.Param18, SUM(S1.Quantity), SUM(S1.Weight) 
FROM [Sheet1$] AS S1 GROUP BY Param1, Param2,Param18
这保证通过OLE DB快速高效地运行。无论是什么VBA代码,效率都要低得多

要随时刷新查询,只需从VBA运行以下代码:

Set ws = ActiveSheet
ws.QueryTables(1).Refresh BackgroundQuery:=False
其中,ws是查找查询表的工作表


使用数据->其他来源->来自Microsoft Query的Microsoft Query,或者可以使用我的加载项:

查看您是否可以使用F8。您需要进入VBE的工具► 引用并添加Microsoft脚本运行时以使用Scripting.Dictionary对象

Sub Stack_Dupes()
    Dim r As Long, c As Long, v As Long, k As Long
    Dim vKEYs As Variant, vITMs As Variant, vTMP1 As Variant, vTMP2 As Variant
    Dim sKEY As String, sITM As String
    Dim dITMs As New Scripting.dictionary
    
    With ActiveSheet
        With .Cells(1, 1).CurrentRegion
            .Rows(1).Cells.Copy _
              Destination:=.Cells(.Rows.Count + 3, 1)
            With .Offset(1, 0).Resize(.Rows.Count - 1, .Columns.Count)
                vKEYs = .Cells.Resize(.Rows.Count, .Columns.Count - 2).Value2
                vITMs = .Cells.Offset(0, .Columns.Count - 2).Resize(.Rows.Count, 2).Value2
                For r = LBound(vKEYs, 1) To UBound(vKEYs, 1)
                    sKEY = Join(Application.Index(vKEYs, r, 0), ChrW(8203))
                    sITM = Join(Application.Index(vITMs, r, 0), ChrW(8203))
                    If dITMs.Exists(sKEY) Then
                        vTMP1 = Split(dITMs.Item(sKEY), ChrW(8203))
                        vTMP2 = Split(sITM, ChrW(8203))
                        vTMP1(0) = CDbl(vTMP1(0)) + CDbl(vTMP2(0))
                        vTMP1(1) = CDbl(vTMP1(1)) + CDbl(vTMP2(1))
                        sITM = Join(vTMP1, ChrW(8203))
                        dITMs.Item(sKEY) = sITM
                    Else
                        dITMs.Add Key:=sKEY, Item:=sITM
                    End If
                Next r
                ReDim vKEYs(1 To (.Columns.Count - 2), 1 To dITMs.Count)
                ReDim vITMs(1 To 2, 1 To dITMs.Count)
                For Each vTMP1 In dITMs.Keys
                    k = k + 1
                    vTMP2 = Split(vTMP1, ChrW(8203))
                    For v = LBound(vTMP2) To UBound(vTMP2)
                        vKEYs(v + 1, k) = vTMP2(v)
                    Next v
                    vTMP2 = Split(dITMs.Item(vTMP1), ChrW(8203))
                    For v = LBound(vTMP2) To UBound(vTMP2)
                        vITMs(v + 1, k) = vTMP2(v)
                    Next v
                Next vTMP1
                .Cells.Offset(.Rows.Count + 3, 0).Resize(UBound(vKEYs, 2), .Columns.Count - 2) = Application.Transpose(vKEYs)
                .Cells.Offset(.Rows.Count + 3, .Columns.Count - 2).Resize(UBound(vITMs, 2), 2) = Application.Transpose(vITMs)
            End With
        End With
    End With
    dITMs.RemoveAll: Set dITMs = Nothing
End Sub
结果写在样本数据下面,如下所示:

如果您有时间将较大数据集上的计时结果发布到这里,我会对这些结果感兴趣。

我找到了这篇关于如何删除我构建了一些代码的相同行的文章(,请转到页面中间)。它并没有百分之百地解决我原来的问题,因为它没有比较数组中的整行,而是比较每行中的每一列,但结果表明它仍然具有相当好的性能

代码如下:

'I couldn't get the final step of pasting the output array into the new sheet working properly without declaring this option, otherwise the data would be pasted one cell to the right and below where I wanted it.
Option Base 1

Dim r, i, c, LastRow, DeletedRows As Integer
Dim input_array, output_array As Variant
Dim identical As Boolean
Dim s As Worksheet
Dim NewRange As Range

LastRow = Cells(65536, 2).End(xlUp).Row

'Load the data set into memory, consisting of 20 columns of data and a 21th column with no data that is used for marking rows for deletion.
input_array = Sheets("1.1").Range(Sheets("1.1").Cells(2, 1), Sheets("1.1").Cells(LastRow, 21)).Value

DeletedRows = 0

'Run through the data set from bottom to top comparing rows one at a time, copy Quantity and Weight values and mark rows for deletion
For r = UBound(input_array) To 2 Step -1
    For i = r - 1 To 2 Step -1

        'Assume row r and i are identical
        identical = True

        'Run through columns of r and i, flag if non-identical value is found and stop the loop (col 18 is ignored, since this is the "Quantity" column, while col 20 is the "Weight" column)
        For c = 1 To 18
            If input_array(r, c) <> input_array(i, c) And c <> 18 Then
                identical = False
                Exit For
            End If
        Next c

        ' If no non-identical columns were found, add "Quantity" and "Weight" from row r to row i and mark row r for deletion
        If identical Then
            input_array(i, 18) = input_array(i, 18) + input_array(r, 18)
            input_array(i, 20) = input_array(i, 20) + input_array(r, 20)
            input_array(r, 21) = "_DELETE_"
            DeletedRows = DeletedRows + 1
            Exit For
        End If

    Next i
Next r

' Resize the new array to the size of the old array minus the number of deleted rows
ReDim output_array(UBound(input_array) - DeletedRows, 20)

' Copy rows not marked as deleted from old array to new array
i = 1
For r = 1 To UBound(input_array)
    If input_array(r, 21) <> "_DELETE_" Then
        For c = 1 To 20
            output_array(i, c) = input_array(r, c)
        Next c
        i = i + 1
    End If
Next r

' Create new sheet and 
Set s = Sheets.Add
Set NewRange = s.Range("A2").Resize(UBound(output_array), 20)
NewRange = output_array
“如果不声明此选项,将输出数组粘贴到新工作表中的最后一步将无法正常工作,否则数据将被粘贴到我想要的右下方的一个单元格中。
选项基数1
Dim r、i、c、LastRow、DeletedRows作为整数
Dim输入_数组,输出_数组作为变量
与布尔值相同
将s设置为工作表
将新范围变暗为范围
LastRow=单元格(65536,2).结束(xlUp).行
'将数据集加载到内存中,由20列数据和第21列组成,第21列没有用于标记要删除的行的数据。
输入数组=表格(“1.1”)。范围(表格(“1.1”)。单元格(2,1),表格(“1.1”)。单元格(最后一行,21))。值
DeletedRows=0
'从下到上遍历数据集,一次比较一行,复制数量和重量值,并标记要删除的行
对于r=UBound(输入_数组)到2步骤-1
对于i=r-1至2,步骤-1
'假设r行和i行相同
相同=正确
'遍历r和i列,如果发现不相同的值,则标记并停止循环(忽略第18列,因为这是“数量”列,而第20列是“重量”列)
对于c=1到18
如果输入_数组(r,c)输入_数组(i,c)和c18,那么
相同=错误
退出
如果结束
下一个c
'如果未找到不相同的列,请将第r行中的“数量”和“重量”添加到第i行,并将第r行标记为删除
如果相同,那么
输入数组(i,18)=输入数组(i,18)+输入数组(r,18)
输入数组(i,20)=输入数组(i,20)+输入数组(r,20)
输入数组(r,21)=“删除”
DeletedRows=DeletedRows+1
退出
如果结束
接下来我
下一个r
'将新数组的大小调整为旧数组的大小减去删除的行数
ReDim输出数组(UBound(输入数组)-DeletedRows,20)
'将未标记为已删除的行从旧数组复制到新数组
i=1
对于r=1到uBond(输入_数组)
如果输入数组(r,21)“\u删除”,则
对于c=1到20
输出数组(i,c)=输入数组(r,c)
下一个c
i=i+1
如果结束
下一个r
'创建新工作表并
设置s=工作表。添加
设置NewRange=s.Range(“A2”)。调整大小(UBound(输出数组),20)
NewRange=输出数组

宏大约需要30秒才能将我的计算机上的20000行减少到3000行,考虑到数据量很大,我认为这非常好。

这可以在毫秒内完成,而不是几秒钟。我还使用数组和dictionary对象。然而,在更简单、更周到的实现中,这一点得到了体现。它比针对工作表使用sql更快。它可以比较任意数量的列,只需确保它们作为串联字符串包含在KeyIn变量中。我还简化了函数,假设我的sum值在第4列。您可以为其他列中的多个值调整代码。我从一个数组写到另一个数组(从InAy到OutAy),字典确定行是否已经存在。字典的Item属性中发生了神奇的变化。当写入新的OutAy行时,我将项目属性值指定给行(r)。然后,当它已经存在时,我检索它所在的行(r)
    sub somesub()
     ...
     data = Range("WhereYourDataIs") 'create data array
     Range("WhereYourDataIs").clear 'assumes you'll output to same location
     data = RemoveDupes(data) 'removedupes and sum values
     Range("A2").Resize(UBound(data), UBound(data, 2)) = data 'A2 assumes your data headers begin in row 1, column 1
     ...
    End Sub

Function RemoveDupes(InAy As Variant) As Variant
    Dim d As Scripting.Dictionary
    Set d = New Scripting.Dictionary
    ReDim OutAy(1 To UBound(InAy), 1 To 4)
    r = 1

    For i = 1 To UBound(InAy)
        KeyIn = ""
        KeyVal = InAy(i, 4) 'the value field to sum/aggregate if exists
        For c = 1 To 3 'a, b, c metadata to roll up
            KeyIn = KeyIn & InAy(i, c)
        Next c
        If d.Exists(KeyIn) Then
            OutAy(d.item(KeyIn), 4) = OutAy(d.item(KeyIn), 4) + KeyVal 'd.item(KeyIn) is r, set when OutAy row was added. Same as OutAy(r,4)=OutAy(r,4) + KeyVal 
            Else:
            d.Add KeyIn, r 'r is set as the item value referencing the row of the OutAy when it was first added. The reference is used when .Exists is true
            For c = 1 To 4
                OutAy(r, c) = InAy(i, c)
            Next c
            r = r + 1
        End If
    Next
    RemoveDupes = OutAy
End Function