Performance 高效合并具有多个匹配键的大型对象数据集_Performance_Powershell_Loops

Performance 高效合并具有多个匹配键的大型对象数据集

performance powershell loops

Performance 高效合并具有多个匹配键的大型对象数据集,performance,powershell,loops,Performance,Powershell,Loops,在Powershell脚本中，我有两个具有多列的数据集。并非所有这些列都是共享的例如，数据集1： A B XY ZY - - -- -- 1 val1 foo1 bar1 2 val2 foo2 bar2 3 val3 foo3 bar3 4 val4 foo4 bar4 5 val5 foo5 bar5 6 val6 foo6 bar6 和数据集2： A B ABC GH - - --- -- 3 val3 foo3 bar3 4 val

在Powershell脚本中，我有两个具有多列的数据集。并非所有这些列都是共享的

例如，数据集1：

A B    XY   ZY  
- -    --   --  
1 val1 foo1 bar1
2 val2 foo2 bar2
3 val3 foo3 bar3
4 val4 foo4 bar4
5 val5 foo5 bar5
6 val6 foo6 bar6

和数据集2：

A B    ABC  GH  
- -    ---  --  
3 val3 foo3 bar3
4 val4 foo4 bar4
5 val5 foo5 bar5
6 val6 foo6 bar6
7 val7 foo7 bar7
8 val8 foo8 bar8

我想合并这两个数据集，指定哪些列作为键（在我的简单示例中是A和B）。预期结果是：

A B    XY   ZY   ABC  GH  
- -    --   --   ---  --  
1 val1 foo1 bar1          
2 val2 foo2 bar2          
3 val3 foo3 bar3 foo3 bar3
4 val4 foo4 bar4 foo4 bar4
5 val5 foo5 bar5 foo5 bar5
6 val6 foo6 bar6 foo6 bar6
7 val7           foo7 bar7
8 val8           foo8 bar8

这个概念非常类似于SQL交叉连接查询

我已经能够成功地编写一个合并对象的函数。不幸的是，计算的持续时间是指数级的

如果我使用以下方法生成数据集：

$dsLength = 10
$dataset1 = 0..$dsLength | %{
    New-Object psobject -Property @{ A=$_ ; B="val$_" ; XY = "foo$_"; ZY ="bar$_" }
}
$dataset2 = ($dsLength/2)..($dsLength*1.5) | %{
    New-Object psobject -Property @{ A=$_ ; B="val$_" ; ABC = "foo$_"; GH ="bar$_" }
}

我得到以下结果：

```
$dsLength=10
```
=>33ms（精细）
```
$dsLength=100
```
=>89ms（精细）
```
$dsLength=1000
```
==>1563ms（可接受）
```
$dsLength=5000
```
=>35764ms（太多）
```
$dsLength=10000
```
==>138047ms（太多）
```
$dsLength=20000
```
=>573614ms（太多）

当数据集很大（我的目标是大约20K个项目）时，如何有效地合并数据集

现在，我已经定义了以下函数：

function Merge-Objects{
    param(
        [Parameter(Mandatory=$true)]
        [object[]]$Dataset1,
        [Parameter(Mandatory=$true)]
        [object[]]$Dataset2,
        [Parameter()]
        [string[]]$Properties
    )

    $result = @()

    $ds1props = $Dataset1 | gm -MemberType Properties
    $ds2props = $Dataset2 | gm -MemberType Properties
    $ds1propsNotInDs2Props = $ds1props | ? { $_.Name -notin ($ds2props | Select -ExpandProperty Name) }
    $ds2propsNotInDs1Props = $ds2props | ? { $_.Name -notin ($ds1props | Select -ExpandProperty Name) }

    foreach($row1 in $Dataset1){
        $result += $row1
        $ds2propsNotInDs1Props | % {
            $row1 | Add-Member -MemberType $_.MemberType -Name $_.Name -Value $null
        }
    }

    foreach($row2 in $Dataset2){
        $existing = foreach($candidate in $result){
            $match = $true
            foreach($prop in $Properties){
                if(-not ($row2.$prop -eq $candidate.$prop)){
                    $match = $false                   
                    break                  
                }
            }
            if($match){
                $candidate
                break
            }
        }
        if(!$existing){
            $ds1propsNotInDs2Props | % {
                $row2 | Add-Member -MemberType $_.MemberType -Name $_.Name -Value $null
            }
            $result += $row2
        }else{
            $ds2propsNotInDs1Props | % {
                $existing.$($_.Name) = $row2.$($_.Name)
            }

        }
    }

    $result
}

我这样称呼这些函数：

Measure-Command -Expression {

    $data = Merge-Objects -Dataset1 $dataset1 -Dataset2 $dataset2 -Properties "A","B" 

}

我的感觉是缓慢是由于第二个循环造成的，在这个循环中，我尝试在每个迭代中匹配一个现有的行

[Edit]使用散列作为索引的第二种方法。令人惊讶的是，它比第一次尝试慢

$dsLength = 1000
$dataset1 = 0..$dsLength | %{
    New-Object psobject -Property @{ A=$_ ; B="val$_" ; XY = "foo$_"; ZY ="bar$_" }
}
$dataset2 = ($dsLength/2)..($dsLength*1.5) | %{
    New-Object psobject -Property @{ A=$_ ; B="val$_" ; ABC = "foo$_"; GH ="bar$_" }
}

function Get-Hash{
    param(
        [Parameter(Mandatory=$true)]
        [object]$InputObject,
        [Parameter()]
        [string[]]$Properties    
    )

    $InputObject | Select-object $properties | Out-String
}


function Merge-Objects{
    param(
        [Parameter(Mandatory=$true)]
        [object[]]$Dataset1,
        [Parameter(Mandatory=$true)]
        [object[]]$Dataset2,
        [Parameter()]
        [string[]]$Properties
    )

    $result = @()
    $index = @{}

    $ds1props = $Dataset1 | gm -MemberType Properties
    $ds2props = $Dataset2 | gm -MemberType Properties
    $allProps = $ds1props + $ds2props | select -Unique

    $ds1propsNotInDs2Props = $ds1props | ? { $_.Name -notin ($ds2props | Select -ExpandProperty Name) }
    $ds2propsNotInDs1Props = $ds2props | ? { $_.Name -notin ($ds1props | Select -ExpandProperty Name) }

    $ds1index = @{}

    foreach($row1 in $Dataset1){
        $tempObject = new-object psobject
        $result += $tempObject
        $ds2propsNotInDs1Props | % {
            $tempObject | Add-Member -MemberType $_.MemberType -Name $_.Name -Value $null
        }
        $ds1props | % {
            $tempObject | Add-Member -MemberType $_.MemberType -Name $_.Name -Value $row1.$($_.Name)
        }

        $hash1 = Get-Hash -InputObject $row1 -Properties $Properties
        $ds1index.Add($hash1, $tempObject)

    }

    foreach($row2 in $Dataset2){
        $hash2 = Get-Hash -InputObject $row2 -Properties $Properties

        if($ds1index.ContainsKey($hash2)){
            # merge object
            $existing = $ds1index[$hash2]
            $ds2propsNotInDs1Props | % {
                $existing.$($_.Name) = $row2.$($_.Name)
            }
            $ds1index.Remove($hash2)

        }else{
            # add object
            $tempObject = new-object psobject
            $ds1propsNotInDs2Props | % {
                $tempObject | Add-Member -MemberType $_.MemberType -Name $_.Name -Value $null
            }
            $ds2props | % {
                $tempObject | Add-Member -MemberType $_.MemberType -Name $_.Name -Value $row2.$($_.Name)
            }
            $result += $tempObject
        }
    }

    $result
}

Measure-Command -Expression {

    $data = Merge-Objects -Dataset1 $dataset1 -Dataset2 $dataset2 -Properties "A","B" 

}

[Edit2]在两个循环周围放置测量命令表明第一个循环仍然缓慢。实际上，第一个循环占用了总时间的50%以上
我同意@Matt。使用哈希表——如下所示。这应该在
m+2n
而不是
mn
时间内运行
我的系统上的计时
上述原液
这看起来绝对是O（n^2）
下面的解决方案
这看起来是线性的
解决方案
我使用了三种技术来提高速度：

切换到哈希表。这允许固定时间的查找，这样就不必有嵌套的循环。这是从O（n^2）到线性时间真正需要的唯一变化。它的缺点是需要进行更多的设置工作。因此，在循环计数足够大以支付设置费用之前，不会看到线性时间的优势

使用ArrayList而不是本机数组。将项添加到本机数组需要重新分配数组并复制所有项。这也是一个O（n^2）运算。由于此操作是在发动机级别进行的，因此常数非常小，因此直到很久以后才会产生影响

使用PsObject.Copy创建新对象。与其他两个相比，这是一个小优化，但它将运行时间减少了一半
--

我同意@Matt。使用哈希表——如下所示。这应该在
m+2n
而不是
mn
时间内运行
我的系统上的计时
上述原液
这看起来绝对是O（n^2）
下面的解决方案
这看起来是线性的
解决方案
我使用了三种技术来提高速度：

切换到哈希表。这允许固定时间的查找，这样就不必有嵌套的循环。这是从O（n^2）到线性时间真正需要的唯一变化。它的缺点是需要进行更多的设置工作。因此，在循环计数足够大以支付设置费用之前，不会看到线性时间的优势

使用ArrayList而不是本机数组。将项添加到本机数组需要重新分配数组并复制所有项。这也是一个O（n^2）运算。由于此操作是在发动机级别进行的，因此常数非常小，因此直到很久以后才会产生影响

使用PsObject.Copy创建新对象。与其他两个相比，这是一个小优化，但它将运行时间减少了一半
--

在将（哈希表）合并到我的cmdlet（另请参见：）中时，我一直有很多疑问，因为在问题的示例中，有一些问题需要克服，这些问题很容易被忽略
不幸的是，我无法与@mhhollomon solution:

dsLength Steve1 Steve2 mhhollomon Join-Object -------- ------ ------ ---------- ----------- 10 19 129 21 50 100 145 915 158 329 1000 2936 9646 1575 3355 5000 56129 69558 5814 12653 10000 183813 95472 14740 25730 20000 761450 265061 36822 80644
但我认为我可以增加一些价值：
不对散列键是字符串，这意味着您需要将相关属性强制转换为字符串，这有点简单，因为：

$Left -eq $Right ≠ "$Left" -eq "$Right"
在大多数情况下，它都可以工作，尤其是当源文件是
.csv
文件时，但它可能会出错，例如，如果数据来自cmdlet，其中
$Null
确实意味着其他内容，而不是空字符串（
'
）。因此，我建议明确定义
$Null
键，例如使用。
由于属性值很容易包含冒号（
：
），我还建议使用控制字符分隔（连接）多个键
也对使用哈希表还有另一个缺陷，实际上不一定是个问题：如果左侧（
$dataset1
）和/或右侧（
$dataset2
）有多个匹配项该怎么办。以以下数据集为例：

$dataset1=
”

A B XY ZY - - -- -- 1 val1 foo1 bar1 2 val2 foo2 bar2 3 val3 foo3 bar3 4 val4 foo4 bar4 4 val4 foo4a bar4a 5 val5 foo5 bar5 6 val6 foo6 bar6 '

A B ABC GH - - --- -- 3 val3 foo3 bar3 4 val4 foo4 bar4 5 val5 foo5 bar5 5 val5 foo5a bar5a 6 val6 foo6 bar6 7 val7 foo7 bar7 8 val8 foo8 bar8 '

$dataset2=
”

A B XY ZY - - -- -- 1 val1 foo1 bar1 2 val2 foo2 bar2 3 val3 foo3 bar3 4 val4 foo4 bar4 4 val4 foo4a bar4a 5 val5 foo5 bar5 6 val6 foo6 bar6 '

A B ABC GH - - --- -- 3 val3 foo3 bar3 4 val4 foo4 bar4 5 val5 foo5 bar5 5 val5 foo5a bar5a 6 val6 foo6 bar6 7 val7 foo7 bar7 8 val8 foo8 bar8 '
在本例中，我希望在SQL连接中会出现类似的结果，并且没有添加
项。输入字典
错误：

$Dataset1 | FullJoin $dataset2 -On A, B | Format-Table A B XY ZY ABC GH - - -- -- --- -- 1 val1 foo1 bar1 2 val2 foo2 bar2 3 val3 foo3 bar3 foo3 bar3 4 val4 foo4 bar4 foo4 bar4 4 val4 foo4a bar4a foo4 bar4 5 val5 foo5 bar5 foo5 bar5 5 val5 foo5 bar5 foo5a bar5a 6 val6 foo6 bar6 foo6 bar6 7 val7 foo7 bar7 8 val8 foo8 bar8
唯一正确的
正如你可能已经知道的，没有理由把两边放在一个哈希表中，但是你可以考虑<强>流左侧（而不是阻塞输入）。在这个问题的示例中，两个数据集都直接加载到内存中，这几乎不是一个用例。更常见的情况是，您的数据来自其他地方，例如，如果您可能是
$Dataset1 | FullJoin $dataset2 -On A, B | Format-Table A B XY ZY ABC GH - - -- -- --- -- 1 val1 foo1 bar1 2 val2 foo2 bar2 3 val3 foo3 bar3 foo3 bar3 4 val4 foo4 bar4 foo4 bar4 4 val4 foo4a bar4a foo4 bar4 5 val5 foo5 bar5 foo5 bar5 5 val5 foo5 bar5 foo5a bar5a 6 val6 foo6 bar6 foo6 bar6 7 val7 foo7 bar7 8 val8 foo8 bar8