Arrays 比较PowerShell中两个较大的文本数组_Arrays_Powershell

Arrays 比较PowerShell中两个较大的文本数组

arrays powershell

Arrays 比较PowerShell中两个较大的文本数组,arrays,powershell,Arrays,Powershell,我有两个数组，我想取它们之间的差异。我在COMPARE-OBJECT方面取得了一些成功，但对于较大的阵列来说速度太慢。在本例中，$ALLVALUES和$ODD是我的两个数组我曾经能够使用FINDSTR高效地完成这项工作例如，FINDSTR/V/G:ODD.txt ALLVALUES.txt>偶数.txtFINDSTR在不到2秒的时间内完成了110000个元素。（甚至必须从磁盘读写）我正在尝试返回FINDSTR性能，它将在ALLVALUES.txt中为我提供与ODD.txt不匹配的所有内容（

我有两个数组，我想取它们之间的差异。我在COMPARE-OBJECT方面取得了一些成功，但对于较大的阵列来说速度太慢。在本例中，$ALLVALUES和$ODD是我的两个数组

我曾经能够使用FINDSTR高效地完成这项工作例如，FINDSTR/V/G:ODD.txt ALLVALUES.txt>偶数.txtFINDSTR在不到2秒的时间内完成了110000个元素。（甚至必须从磁盘读写）

我正在尝试返回FINDSTR性能，它将在ALLVALUES.txt中为我提供与ODD.txt不匹配的所有内容（在本例中为我提供偶数值）

注意：这个问题不是关于奇数或偶数，只是一个可以快速直观地验证其是否按预期工作的实际示例

这是我一直在玩的代码。使用COMPARE-OBJECT，100000需要200秒，而我的计算机上的FINDSTR需要2秒。我认为在PowerShell中有一种更优雅的方法来实现这一点。谢谢你的帮助

# -------  Build the MAIN array
$MIN = 1
$MAX = 100000
$PREFIX = "AA"

$ALLVALUES = while ($MIN -le $MAX) 
{
   "$PREFIX{0:D6}" -f $MIN++
}


# -------  Build the ODD values from the MAIN array
$MIN = 1
$MAX = 100000
$PREFIX = "AA"

$ODD = while ($MIN -le $MAX) 
{
   If ($MIN%2) {
      "$PREFIX{0:D6}" -f $MIN++
   }
  ELSE {
    $MIN++
   }
}

Measure-Command{$EVEN = Compare-Object -DifferenceObject $ODD -ReferenceObject $ALLVALUES -PassThru}

数组是对象，而不仅仅是findstr处理的简单文本块。
字符串数组的最快差异是.NET3.5+

使用您的数据，i7 CPU上的100k元素为46毫秒

上面的代码省略了重复值，因此如果输出中需要这些值，我认为我们将不得不使用速度慢得多的手动枚举

function Diff-Array($a, $b, [switch]$unique) {
    if ($unique.IsPresent) {
        $diff = [Collections.Generic.HashSet[string]]$a
        $diff.SymmetricExceptWith([Collections.Generic.HashSet[string]]$b)
        return [string[]]$diff
    }
    $occurrences = @{}
    foreach ($_ in $a) { $occurrences[$_]++ }
    foreach ($_ in $b) { $occurrences[$_]-- }
    foreach ($_ in $occurrences.GetEnumerator()) {
        $cnt = [Math]::Abs($_.value)
        while ($cnt--) { $_.key }
    }
}

用法：

$diffArray = Diff-Array $ALLVALUES $ODD

340毫秒，比哈希集慢8倍，但比比较对象快110倍

最后，我们可以为字符串/数字数组创建更快的比较对象：

function Compare-StringArray($a, $b, [switch]$unsorted) {
    $occurrences = if ($unsorted.IsPresent) { @{} }
                   else { [Collections.Generic.SortedDictionary[string,int]]::new() }
    foreach ($_ in $a) { $occurrences[$_]++ }
    foreach ($_ in $b) { $occurrences[$_]-- }
    foreach ($_ in $occurrences.GetEnumerator()) {
        $cnt = $_.value
        if ($cnt) {
            $diff = [PSCustomObject]@{
                InputObject = $_.key
                SideIndicator = if ($cnt -lt 0) { '=>' } else { '<=' }
            }
            $cnt = [Math]::Abs($cnt)
            while ($cnt--) {
                $diff
            }
        }
    }
}

函数比较字符串数组（$a，$b，[switch]$unsorted）{
$occurrents=if（$unsorted.IsPresent）{@{}
else{[Collections.Generic.SortedDictionary[string，int]]：：new（）}
foreach（$\单位为$a）{$occurrents[$\单位为++}
foreach（$\单位为$b）{$occurrents[$\]--}
foreach（$中的$出现次数。GetEnumerator（））{
$cnt=$\u0.value
若有（$cnt）{
$diff=[PSCustomObject]@{
InputObject=$\u0.key
SideIndicator=if（$cnt-lt 0）{'=>'}else{'数组是对象，而不仅仅是findstr处理的简单文本块。

字符串数组的最快差异是.NET3.5+
使用您的数据，i7 CPU上的100k元素为46毫秒
上面的代码省略了重复值，因此如果输出中需要这些值，我认为我们将不得不使用速度慢得多的手动枚举
function Diff-Array($a, $b, [switch]$unique) {
    if ($unique.IsPresent) {
        $diff = [Collections.Generic.HashSet[string]]$a
        $diff.SymmetricExceptWith([Collections.Generic.HashSet[string]]$b)
        return [string[]]$diff
    }
    $occurrences = @{}
    foreach ($_ in $a) { $occurrences[$_]++ }
    foreach ($_ in $b) { $occurrences[$_]-- }
    foreach ($_ in $occurrences.GetEnumerator()) {
        $cnt = [Math]::Abs($_.value)
        while ($cnt--) { $_.key }
    }
}

用法：
$diffArray = Diff-Array $ALLVALUES $ODD

340毫秒，比哈希集慢8倍，但比比较对象快110倍
最后，我们可以为字符串/数字数组创建更快的比较对象：
function Compare-StringArray($a, $b, [switch]$unsorted) {
    $occurrences = if ($unsorted.IsPresent) { @{} }
                   else { [Collections.Generic.SortedDictionary[string,int]]::new() }
    foreach ($_ in $a) { $occurrences[$_]++ }
    foreach ($_ in $b) { $occurrences[$_]-- }
    foreach ($_ in $occurrences.GetEnumerator()) {
        $cnt = $_.value
        if ($cnt) {
            $diff = [PSCustomObject]@{
                InputObject = $_.key
                SideIndicator = if ($cnt -lt 0) { '=>' } else { '<=' }
            }
            $cnt = [Math]::Abs($cnt)
            while ($cnt--) {
                $diff
            }
        }
    }
}

函数比较字符串数组（$a，$b，[switch]$unsorted）{
$occurrents=if（$unsorted.IsPresent）{@{}
else{[Collections.Generic.SortedDictionary[string，int]]：：new（）}
foreach（$\单位为$a）{$occurrents[$\单位为++}
foreach（$\单位为$b）{$occurrents[$\]--}
foreach（$中的$出现次数。GetEnumerator（））{
$cnt=$\u0.value
若有（$cnt）{
$diff=[PSCustomObject]@{
InputObject=$\u0.key
SideIndicator=if（$cnt-lt 0）{'=>'}else{“散列方法”发布得非常快。谢谢您的回答！任何其他方法都能以合理的速度来考虑。我添加了比较StringArray并改进了哈什图代码。发布的哈希方法非常快。谢谢您的回答！还有其他方法要考虑合理的速度吗？我添加了比较StringArray和改进了hset code.awesome解决方案，在30秒内处理了1.4M vs 600k字符串，在我的例子中，我使用IntersectWith是因为我想要IntersectionWesome解决方案，在30秒内处理了1.4M vs 600k字符串，在我的例子中，我使用IntersectWith是因为我想要交叉点