Regex 跳过高性能Powershell正则表达式脚本块中的标题行
我从堆栈溢出得到了一些惊人的帮助。。。然而。。。太神奇了,我需要更多的帮助才能接近终点线。我每月2次解析多个巨大的4GB文件。我需要能够跳过标题,计算总行数、匹配行数和不匹配行数。我相信这对于一个超级明星来说是非常简单的,但在我的新手PS级别,我的技能还不强。也许你的一点帮助可以挽救这一周 数据样本:Regex 跳过高性能Powershell正则表达式脚本块中的标题行,regex,powershell,Regex,Powershell,我从堆栈溢出得到了一些惊人的帮助。。。然而。。。太神奇了,我需要更多的帮助才能接近终点线。我每月2次解析多个巨大的4GB文件。我需要能够跳过标题,计算总行数、匹配行数和不匹配行数。我相信这对于一个超级明星来说是非常简单的,但在我的新手PS级别,我的技能还不强。也许你的一点帮助可以挽救这一周 数据样本: ID FIRST_NAME LAST_NAME COLUMN_NM_TOO_LON5THCOLUMN 10000000001MINNI
ID FIRST_NAME LAST_NAME COLUMN_NM_TOO_LON5THCOLUMN
10000000001MINNIE MOUSE COLUMN VALUE LONGSTARTS
10000000002MICKLE ROONEY MOUSE COLUMN VALUE LONGSTARTS
代码块(基于):
您只需要跟踪两个计数-匹配的和不匹配的行-然后使用布尔值指示是否跳过了第一行
$first = $false
$matched = 0
$unmatched = 0
. {
switch -File $infile -Regex {
$match_regex {
if($first){
# Join what all the capture groups matched with a tab char.
$Matches[1..($Matches.Count-1)].Trim() -join "`t"
$matched++
}
$first = $true
}
default{
$unmatched++
# you can remove this, if the pattern always matches the header
$first = $true
}
}
} | Out-File $outFile
$total = $matched + $unmatched
使用System.IO.StreamReader将处理时间减少到原来的20%左右。这是我绝对需要的要求 我在不牺牲性能的情况下添加了逻辑和计数器。字段计数器和逐行比较在查找不良记录时特别有用 这是实际代码的复制/粘贴,但我缩短了一些内容,制作了一些稍微伪代码的内容,因此您可能需要使用它来让事情为您自己工作
Function Get-Regx-Data-Format() {
Param ([String] $filename)
if ($filename -eq 'FILE NAME') {
[regex]$match_regex = '^(.{10})(.{10})(.{10})(.{30})(.{30})(.{30})(.{4})(.{1})'
}
return $match_regex
}
Foreach ($file in $cutoff_files) {
$starttime_for_file = (Get-Date)
$source_file = $file + '_' + $proc_yyyymm + $source_file_suffix
$source_path = $source_dir + $source_file
$parse_file = $file + '_' + $proc_yyyymm + '_load' +$parse_target_suffix
$parse_file_path = $parse_target_dir + $parse_file
$error_file = $file + '_err_' + $proc_yyyymm + $error_target_suffix
$error_file_path = $error_target_dir + $error_file
[regex]$match_data_regex = Get-Regx-Data-Format $file
Remove-Item -path "$parse_file_path" -Force -ErrorAction SilentlyContinue
Remove-Item -path "$error_file_path" -Force -ErrorAction SilentlyContinue
[long]$matched_cnt = 0
[long]$unmatched_cnt = 0
[long]$loop_counter = 0
[boolean]$has_header_row=$true
[int]$field_cnt=0
[int]$previous_field_cnt=0
[int]$array_length=0
$parse_minutes = Measure-Command {
try {
$stream_log = [System.IO.StreamReader]::new($source_path)
$stream_in = [System.IO.StreamReader]::new($source_path)
$stream_out = [System.IO.StreamWriter]::new($parse_file_path)
$stream_err = [System.IO.StreamWriter]::new($error_file_path)
while ($line = $stream_in.ReadLine()) {
if ($line -match $match_data_regex) {
#if matched and it's the header, parse and write to the beg of output file
if (($loop_counter -eq 0) -and $has_header_row) {
$stream_out.WriteLine(($Matches[1..($array_length)].Trim() -join "`t"))
} else {
$previous_field_cnt = $field_cnt
#add year month to line start, trim and join every captured field w/tabs
$stream_out.WriteLine("$proc_yyyymm`t" + `
($Matches[1..($array_length)].Trim() -join "`t"))
$matched_cnt++
$field_cnt=$Matches.Count
if (($previous_field_cnt -ne $field_cnt) -and $loop_counter -gt 1) {
write-host "`nError on line $($loop_counter + 1). `
The field count does not match the previous correctly `
formatted (non-error) row."
}
}
} else {
if (($loop_counter -eq 0) -and $has_header_row) {
#if the header, write to the beginning of the output file
$stream_out.WriteLine($line)
} else {
$stream_err.WriteLine($line)
$unmatched_cnt++
}
}
$loop_counter++
}
} finally {
$stream_in.Dispose()
$stream_out.Dispose()
$stream_err.Dispose()
$stream_log.Dispose()
}
} | Select-Object -Property TotalMinutes
write-host "`n$file_list_idx. File $file parsing results....`nMatched Count =
$matched_cnt UnMatched Count = $unmatched_cnt Parse Minutes = $parse_minutes`n"
$file_list_idx++
$endtime_for_file = (Get-Date)
write-host "`nEnded processing file at $endtime_for_file"
$TimeDiff_for_file = (New-TimeSpan $starttime_for_file $endtime_for_file)
$Hrs_for_file = $TimeDiff_for_file.Hours
$Mins_for_file = $TimeDiff_for_file.Minutes
$Secs_for_file = $TimeDiff_for_file.Seconds
write-host "`nElapsed Time for file $file processing:
$Hrs_for_file`:$Mins_for_file`:$Secs_for_file"
}
$endtime = (Get-Date -format "HH:mm:ss")
$TimeDiff = (New-TimeSpan $starttime $endtime)
$Hrs = $TimeDiff.Hours
$Mins = $TimeDiff.Minutes
$Secs = $TimeDiff.Seconds
write-host "`nTotal Elapsed Time: $Hrs`:$Mins`:$Secs"
是的,我确定我需要初始化一些行计数器。。。ttl_行、ttl_匹配、ttl_不匹配,然后具有适当的++和else块;然而,我仍然在语法上受到挑战,我希望有人能展示出这些优点在这个从你自己(stackoverflow)的建议修改而来的超级流畅的代码块中的适用性。然后,当然,当时间允许的时候,我会研究各种可能的优化方法!(调用C++,NET我们会看到…)到目前为止,我对这个版本非常满意…只是在内务管理上有点绊倒。解释一下这个正则表达式应该做什么。另外
不匹配行=总行-匹配行x15-大正则表达式将一个固定列长度的文本文件切分,然后修剪捕获组并用选项卡连接。我想,有这个答案的人几乎会立即识别编码模式,并知道插入内容的位置,但是,如果我说得太含糊,很抱歉。@Mark:$\ucode>在switch
语句的脚本块中包含了手边的输入行。我将变量完全从点脚本块中取出,放在设置regex变量之前,现在我正在测试这个大文件。如果有效,你用@mklement0软膏的答案就是答案。下一步,我将研究将不匹配的行输出到一个单独的文件是超级简单,还是另一次到无穷远的旅行!现在一切正常,甚至将不匹配的行输出到错误文件。是 啊1.8GB文件+100列的解析时间约为7.8分钟。我很高兴!有一件事,你能解释一下为什么$false开头是$false吗?这是用于某种类型的优化吗?因为我们从上到下读取文件,所以$first不会以$true开头。我很清楚,所以我只是想知道那里的想法。另外,默认值将匹配任何错误,那么我们是否只测试true,然后如果为true,则设置false(这意味着头恰好与任何内容都不匹配)?无论如何,我不知道Stackoverflow上的工作原理,所以我很好奇。@请注意,在进入开关之前,您还没有完成第一行的处理?:)好的,我知道你在那里做了什么$第一个表示第一行已经通过。我的想法正好相反。谢谢你的澄清。
Function Get-Regx-Data-Format() {
Param ([String] $filename)
if ($filename -eq 'FILE NAME') {
[regex]$match_regex = '^(.{10})(.{10})(.{10})(.{30})(.{30})(.{30})(.{4})(.{1})'
}
return $match_regex
}
Foreach ($file in $cutoff_files) {
$starttime_for_file = (Get-Date)
$source_file = $file + '_' + $proc_yyyymm + $source_file_suffix
$source_path = $source_dir + $source_file
$parse_file = $file + '_' + $proc_yyyymm + '_load' +$parse_target_suffix
$parse_file_path = $parse_target_dir + $parse_file
$error_file = $file + '_err_' + $proc_yyyymm + $error_target_suffix
$error_file_path = $error_target_dir + $error_file
[regex]$match_data_regex = Get-Regx-Data-Format $file
Remove-Item -path "$parse_file_path" -Force -ErrorAction SilentlyContinue
Remove-Item -path "$error_file_path" -Force -ErrorAction SilentlyContinue
[long]$matched_cnt = 0
[long]$unmatched_cnt = 0
[long]$loop_counter = 0
[boolean]$has_header_row=$true
[int]$field_cnt=0
[int]$previous_field_cnt=0
[int]$array_length=0
$parse_minutes = Measure-Command {
try {
$stream_log = [System.IO.StreamReader]::new($source_path)
$stream_in = [System.IO.StreamReader]::new($source_path)
$stream_out = [System.IO.StreamWriter]::new($parse_file_path)
$stream_err = [System.IO.StreamWriter]::new($error_file_path)
while ($line = $stream_in.ReadLine()) {
if ($line -match $match_data_regex) {
#if matched and it's the header, parse and write to the beg of output file
if (($loop_counter -eq 0) -and $has_header_row) {
$stream_out.WriteLine(($Matches[1..($array_length)].Trim() -join "`t"))
} else {
$previous_field_cnt = $field_cnt
#add year month to line start, trim and join every captured field w/tabs
$stream_out.WriteLine("$proc_yyyymm`t" + `
($Matches[1..($array_length)].Trim() -join "`t"))
$matched_cnt++
$field_cnt=$Matches.Count
if (($previous_field_cnt -ne $field_cnt) -and $loop_counter -gt 1) {
write-host "`nError on line $($loop_counter + 1). `
The field count does not match the previous correctly `
formatted (non-error) row."
}
}
} else {
if (($loop_counter -eq 0) -and $has_header_row) {
#if the header, write to the beginning of the output file
$stream_out.WriteLine($line)
} else {
$stream_err.WriteLine($line)
$unmatched_cnt++
}
}
$loop_counter++
}
} finally {
$stream_in.Dispose()
$stream_out.Dispose()
$stream_err.Dispose()
$stream_log.Dispose()
}
} | Select-Object -Property TotalMinutes
write-host "`n$file_list_idx. File $file parsing results....`nMatched Count =
$matched_cnt UnMatched Count = $unmatched_cnt Parse Minutes = $parse_minutes`n"
$file_list_idx++
$endtime_for_file = (Get-Date)
write-host "`nEnded processing file at $endtime_for_file"
$TimeDiff_for_file = (New-TimeSpan $starttime_for_file $endtime_for_file)
$Hrs_for_file = $TimeDiff_for_file.Hours
$Mins_for_file = $TimeDiff_for_file.Minutes
$Secs_for_file = $TimeDiff_for_file.Seconds
write-host "`nElapsed Time for file $file processing:
$Hrs_for_file`:$Mins_for_file`:$Secs_for_file"
}
$endtime = (Get-Date -format "HH:mm:ss")
$TimeDiff = (New-TimeSpan $starttime $endtime)
$Hrs = $TimeDiff.Hours
$Mins = $TimeDiff.Minutes
$Secs = $TimeDiff.Seconds
write-host "`nTotal Elapsed Time: $Hrs`:$Mins`:$Secs"