Regex 跳过高性能Powershell正则表达式脚本块中的标题行

Regex 跳过高性能Powershell正则表达式脚本块中的标题行,regex,powershell,Regex,Powershell,我从堆栈溢出得到了一些惊人的帮助。。。然而。。。太神奇了,我需要更多的帮助才能接近终点线。我每月2次解析多个巨大的4GB文件。我需要能够跳过标题,计算总行数、匹配行数和不匹配行数。我相信这对于一个超级明星来说是非常简单的,但在我的新手PS级别,我的技能还不强。也许你的一点帮助可以挽救这一周 数据样本: ID FIRST_NAME LAST_NAME COLUMN_NM_TOO_LON5THCOLUMN 10000000001MINNI

我从堆栈溢出得到了一些惊人的帮助。。。然而。。。太神奇了,我需要更多的帮助才能接近终点线。我每月2次解析多个巨大的4GB文件。我需要能够跳过标题,计算总行数、匹配行数和不匹配行数。我相信这对于一个超级明星来说是非常简单的,但在我的新手PS级别,我的技能还不强。也许你的一点帮助可以挽救这一周

数据样本:

ID         FIRST_NAME              LAST_NAME          COLUMN_NM_TOO_LON5THCOLUMN
 10000000001MINNIE                 MOUSE              COLUMN VALUE LONGSTARTS 
 10000000002MICKLE ROONEY          MOUSE              COLUMN VALUE LONGSTARTS 
代码块(基于):


您只需要跟踪两个计数-匹配的和不匹配的行-然后使用布尔值指示是否跳过了第一行

$first = $false
$matched = 0
$unmatched = 0
. {
    switch -File $infile -Regex  {
        $match_regex {
            if($first){
                # Join what all the capture groups matched with a tab char.
                $Matches[1..($Matches.Count-1)].Trim() -join "`t"
                $matched++
            }
            $first = $true
        }
        default{
            $unmatched++
            # you can remove this, if the pattern always matches the header
            $first = $true
        }
    }
} | Out-File $outFile

$total = $matched + $unmatched

使用System.IO.StreamReader将处理时间减少到原来的20%左右。这是我绝对需要的要求

我在不牺牲性能的情况下添加了逻辑和计数器。字段计数器和逐行比较在查找不良记录时特别有用

这是实际代码的复制/粘贴,但我缩短了一些内容,制作了一些稍微伪代码的内容,因此您可能需要使用它来让事情为您自己工作

Function Get-Regx-Data-Format() {
    Param ([String] $filename)

    if ($filename -eq 'FILE NAME') {
        [regex]$match_regex = '^(.{10})(.{10})(.{10})(.{30})(.{30})(.{30})(.{4})(.{1})'
    }
    return $match_regex
}

Foreach ($file in $cutoff_files) {

  $starttime_for_file = (Get-Date)
  $source_file = $file + '_' + $proc_yyyymm + $source_file_suffix
  $source_path = $source_dir + $source_file

  $parse_file = $file + '_' + $proc_yyyymm + '_load' +$parse_target_suffix
  $parse_file_path = $parse_target_dir + $parse_file

  $error_file = $file + '_err_' + $proc_yyyymm + $error_target_suffix
  $error_file_path = $error_target_dir + $error_file

  [regex]$match_data_regex = Get-Regx-Data-Format $file

  Remove-Item -path "$parse_file_path" -Force -ErrorAction SilentlyContinue
  Remove-Item -path "$error_file_path" -Force -ErrorAction SilentlyContinue

  [long]$matched_cnt = 0
  [long]$unmatched_cnt = 0
  [long]$loop_counter = 0
  [boolean]$has_header_row=$true
  [int]$field_cnt=0
  [int]$previous_field_cnt=0
  [int]$array_length=0

  $parse_minutes = Measure-Command {
    try {
        $stream_log = [System.IO.StreamReader]::new($source_path)
        $stream_in = [System.IO.StreamReader]::new($source_path)
        $stream_out = [System.IO.StreamWriter]::new($parse_file_path)
        $stream_err = [System.IO.StreamWriter]::new($error_file_path)

        while ($line = $stream_in.ReadLine()) {

          if ($line -match $match_data_regex) {

              #if matched and it's the header, parse and write to the beg of output file
              if (($loop_counter -eq 0) -and $has_header_row) {
                  $stream_out.WriteLine(($Matches[1..($array_length)].Trim() -join "`t"))

              } else {
                  $previous_field_cnt = $field_cnt

                  #add year month to line start, trim and join every captured field w/tabs
                  $stream_out.WriteLine("$proc_yyyymm`t" + `
                         ($Matches[1..($array_length)].Trim() -join "`t"))

                  $matched_cnt++
                  $field_cnt=$Matches.Count

                  if (($previous_field_cnt -ne $field_cnt) -and $loop_counter -gt 1) {
                    write-host "`nError on line $($loop_counter + 1). `
                                The field count does not match the previous correctly `
                                formatted (non-error) row."
                  }

              }
          } else {
              if (($loop_counter -eq 0) -and $has_header_row) {
                #if the header, write to the beginning of the output file
                  $stream_out.WriteLine($line)
              } else {
                $stream_err.WriteLine($line)
                $unmatched_cnt++
              }
          }
          $loop_counter++
       }
    } finally {
        $stream_in.Dispose()
        $stream_out.Dispose()
        $stream_err.Dispose()
        $stream_log.Dispose()
    }
  } | Select-Object -Property TotalMinutes

  write-host "`n$file_list_idx. File $file parsing results....`nMatched Count = 
  $matched_cnt  UnMatched Count = $unmatched_cnt  Parse Minutes = $parse_minutes`n"

  $file_list_idx++

  $endtime_for_file = (Get-Date)
  write-host "`nEnded processing file at $endtime_for_file"

  $TimeDiff_for_file = (New-TimeSpan $starttime_for_file $endtime_for_file)
  $Hrs_for_file = $TimeDiff_for_file.Hours
  $Mins_for_file = $TimeDiff_for_file.Minutes
  $Secs_for_file = $TimeDiff_for_file.Seconds 
  write-host "`nElapsed Time for file $file processing: 
  $Hrs_for_file`:$Mins_for_file`:$Secs_for_file"

}

$endtime = (Get-Date -format "HH:mm:ss")
$TimeDiff = (New-TimeSpan $starttime $endtime)
$Hrs = $TimeDiff.Hours
$Mins = $TimeDiff.Minutes
$Secs = $TimeDiff.Seconds 
write-host "`nTotal Elapsed Time: $Hrs`:$Mins`:$Secs"

是的,我确定我需要初始化一些行计数器。。。ttl_行、ttl_匹配、ttl_不匹配,然后具有适当的++和else块;然而,我仍然在语法上受到挑战,我希望有人能展示出这些优点在这个从你自己(stackoverflow)的建议修改而来的超级流畅的代码块中的适用性。然后,当然,当时间允许的时候,我会研究各种可能的优化方法!(调用C++,NET我们会看到…)到目前为止,我对这个版本非常满意…只是在内务管理上有点绊倒。解释一下这个正则表达式应该做什么。另外
不匹配行=总行-匹配行
x15-大正则表达式将一个固定列长度的文本文件切分,然后修剪捕获组并用选项卡连接。我想,有这个答案的人几乎会立即识别编码模式,并知道插入内容的位置,但是,如果我说得太含糊,很抱歉。@Mark:
$\ucode>在
switch
语句的脚本块中包含了手边的输入行。我将变量完全从点脚本块中取出,放在设置regex变量之前,现在我正在测试这个大文件。如果有效,你用@mklement0软膏的答案就是答案。下一步,我将研究将不匹配的行输出到一个单独的文件是超级简单,还是另一次到无穷远的旅行!现在一切正常,甚至将不匹配的行输出到错误文件。是 啊1.8GB文件+100列的解析时间约为7.8分钟。我很高兴!有一件事,你能解释一下为什么$false开头是$false吗?这是用于某种类型的优化吗?因为我们从上到下读取文件,所以$first不会以$true开头。我很清楚,所以我只是想知道那里的想法。另外,默认值将匹配任何错误,那么我们是否只测试true,然后如果为true,则设置false(这意味着头恰好与任何内容都不匹配)?无论如何,我不知道Stackoverflow上的工作原理,所以我很好奇。@请注意,在进入开关之前,您还没有完成第一行的处理?:)好的,我知道你在那里做了什么$第一个表示第一行已经通过。我的想法正好相反。谢谢你的澄清。
Function Get-Regx-Data-Format() {
    Param ([String] $filename)

    if ($filename -eq 'FILE NAME') {
        [regex]$match_regex = '^(.{10})(.{10})(.{10})(.{30})(.{30})(.{30})(.{4})(.{1})'
    }
    return $match_regex
}

Foreach ($file in $cutoff_files) {

  $starttime_for_file = (Get-Date)
  $source_file = $file + '_' + $proc_yyyymm + $source_file_suffix
  $source_path = $source_dir + $source_file

  $parse_file = $file + '_' + $proc_yyyymm + '_load' +$parse_target_suffix
  $parse_file_path = $parse_target_dir + $parse_file

  $error_file = $file + '_err_' + $proc_yyyymm + $error_target_suffix
  $error_file_path = $error_target_dir + $error_file

  [regex]$match_data_regex = Get-Regx-Data-Format $file

  Remove-Item -path "$parse_file_path" -Force -ErrorAction SilentlyContinue
  Remove-Item -path "$error_file_path" -Force -ErrorAction SilentlyContinue

  [long]$matched_cnt = 0
  [long]$unmatched_cnt = 0
  [long]$loop_counter = 0
  [boolean]$has_header_row=$true
  [int]$field_cnt=0
  [int]$previous_field_cnt=0
  [int]$array_length=0

  $parse_minutes = Measure-Command {
    try {
        $stream_log = [System.IO.StreamReader]::new($source_path)
        $stream_in = [System.IO.StreamReader]::new($source_path)
        $stream_out = [System.IO.StreamWriter]::new($parse_file_path)
        $stream_err = [System.IO.StreamWriter]::new($error_file_path)

        while ($line = $stream_in.ReadLine()) {

          if ($line -match $match_data_regex) {

              #if matched and it's the header, parse and write to the beg of output file
              if (($loop_counter -eq 0) -and $has_header_row) {
                  $stream_out.WriteLine(($Matches[1..($array_length)].Trim() -join "`t"))

              } else {
                  $previous_field_cnt = $field_cnt

                  #add year month to line start, trim and join every captured field w/tabs
                  $stream_out.WriteLine("$proc_yyyymm`t" + `
                         ($Matches[1..($array_length)].Trim() -join "`t"))

                  $matched_cnt++
                  $field_cnt=$Matches.Count

                  if (($previous_field_cnt -ne $field_cnt) -and $loop_counter -gt 1) {
                    write-host "`nError on line $($loop_counter + 1). `
                                The field count does not match the previous correctly `
                                formatted (non-error) row."
                  }

              }
          } else {
              if (($loop_counter -eq 0) -and $has_header_row) {
                #if the header, write to the beginning of the output file
                  $stream_out.WriteLine($line)
              } else {
                $stream_err.WriteLine($line)
                $unmatched_cnt++
              }
          }
          $loop_counter++
       }
    } finally {
        $stream_in.Dispose()
        $stream_out.Dispose()
        $stream_err.Dispose()
        $stream_log.Dispose()
    }
  } | Select-Object -Property TotalMinutes

  write-host "`n$file_list_idx. File $file parsing results....`nMatched Count = 
  $matched_cnt  UnMatched Count = $unmatched_cnt  Parse Minutes = $parse_minutes`n"

  $file_list_idx++

  $endtime_for_file = (Get-Date)
  write-host "`nEnded processing file at $endtime_for_file"

  $TimeDiff_for_file = (New-TimeSpan $starttime_for_file $endtime_for_file)
  $Hrs_for_file = $TimeDiff_for_file.Hours
  $Mins_for_file = $TimeDiff_for_file.Minutes
  $Secs_for_file = $TimeDiff_for_file.Seconds 
  write-host "`nElapsed Time for file $file processing: 
  $Hrs_for_file`:$Mins_for_file`:$Secs_for_file"

}

$endtime = (Get-Date -format "HH:mm:ss")
$TimeDiff = (New-TimeSpan $starttime $endtime)
$Hrs = $TimeDiff.Hours
$Mins = $TimeDiff.Minutes
$Secs = $TimeDiff.Seconds 
write-host "`nTotal Elapsed Time: $Hrs`:$Mins`:$Secs"