Powershell 大型文本文件的匹配操作速度问题我有36个.log文件的数据基础，我需要预处理，以便将它们加载到Python框架内的数据可视化的大数据帧中。_Powershell_Powershell 3.0

Powershell 大型文本文件的匹配操作速度问题我有36个.log文件的数据基础，我需要预处理，以便将它们加载到Python框架内的数据可视化的大数据帧中。

powershell

Powershell 大型文本文件的匹配操作速度问题我有36个.log文件的数据基础，我需要预处理，以便将它们加载到Python框架内的数据可视化的大数据帧中。,powershell,powershell-3.0,Powershell,Powershell 3.0,要提供一个.log文件中的单行示例，请执行以下操作： [16:24:42]: Downloaded 0 Z_SYSTEM_FM traces from DEH, clients (282) from 00:00:00,000 to 00:00:00,000 从这里的几个来源和帖子中，我发现以下代码是性能最好的代码： foreach ($f in $files){ $date = $f.BaseName.Substring(22,8) ((Get-Content $f) -

要提供一个.log文件中的单行示例，请执行以下操作：

[16:24:42]: Downloaded 0 Z_SYSTEM_FM traces from DEH, clients (282) from 00:00:00,000 to 00:00:00,000

从这里的几个来源和帖子中，我发现以下代码是性能最好的代码：

foreach ($f in $files){

    $date = $f.BaseName.Substring(22,8)

    ((Get-Content $f) -match "^.*\bDownloaded\b.*$") -replace "[[]", "" -replace "]:\s", " " 
    -replace "Downloaded " -replace "Traces from " -replace ",.*" -replace "$", " $date" 
    | add-content CleanedLogs.txt

}

变量

$date

包含日期，相应的.log文件正在记录

我无法更改输入的文本数据。我尝试使用-raw读取1,55GB的数据，但在处理完所有操作后，我无法拆分生成的单个字符串。此外，我尝试使用更多的正则表达式，但并没有减少总的运行时间。也许有一种方法可以使用grep进行此操作

也许有人做了一个巧妙的调整来加速这个操作。目前，此操作需要将近20分钟的计算时间。多谢各位

也许这会加快你的速度：

$outFile = Join-Path -Path $PSScriptRoot -ChildPath 'CleanedLogs.txt'
$files   = Get-ChildItem -Path '<YOUR ROOTFOLDER>' -Filter '*.txt' -File
foreach ($f in $files){
    $date = $f.BaseName.Substring(22,8)
    [string[]]$lines = ([System.IO.File]::ReadAllLines($f.FullName) | Where-Object {$_ -match '^.*\bDownloaded\b.*$'} | ForEach-Object {
        ($_ -replace '\[|Downloaded|Traces from|,.*', '' -replace ']:\s', ' ' -replace '\s+', ' ') + " $date"
    })
    [System.IO.File]::AppendAllLines($outFile, $lines)
}

$outFile=Join Path-Path$PSScriptRoot-ChildPath'CleanedLogs.txt'
$files=获取子项-路径“”-过滤器'*.txt'-文件
foreach（$f在$files中）{
$date=$f.BaseName.Substring（22,8）
[string[]$lines=（[System.IO.File]：：ReadAllLines（$f.FullName）|其中对象{$\u-match'^..*\b下载\b.*$'}| ForEach对象{
（$|-replace'\[|下载的|跟踪自|，.*，'-replace']：\s'，'-replace'\s+，''）+“$date”
})
[System.IO.File]：：附录行（$outFile，$line）
}

我过去也遇到过类似的问题。长话短说，当使用大型文件时，直接使用.NET要快得多。你可以通过阅读学到更多

最快的方法可能是使用

IO.FileStream

。例如：

$File = "C:\Path_To_File\Logs.txt"
$FileToSave = "C:\Path_To_File\result.txt"
$Stream = New-Object -TypeName IO.FileStream -ArgumentList ($File), ([System.IO.FileMode]::Open), ([System.IO.FileAccess]::Read), ([System.IO.FileShare]::ReadWrite)
$Reader = New-Object -TypeName System.IO.StreamReader -ArgumentList ($Stream, [System.Text.Encoding]::ASCII, $true)
$Writer = New-Object -TypeName System.IO.StreamWriter -ArgumentList ($FileToSave)
while (!$Reader.EndOfStream)
{
    $Box = $Reader.ReadLine()
    if($Box -match "^.*\bDownloaded\b.*$")
    {
        $ReplaceLine = $Box -replace "1", "1234" -replace "[[]", ""
        $Writer.WriteLine($ReplaceLine)
    }
}
$Reader.Close()
$Writer.Close()
$Stream.Close()

您应该能够根据自己的需要很容易地编辑上面的代码。用于获取可使用的文件列表

另外，我建议您阅读stackoverflow post。

提高性能的关键是：

避免使用管道和cmdlet，尤其是文件I/O（
```
获取内容
```
，
```
添加内容
```
）
- 改用类型的方法
避免在PowerShell代码中循环。
- 取而代之的是链式数组感知操作符，如
```
-match
```
  和
```
-replace
```
  ——您已经在这样做了
- 整合正则表达式以减少
```
-replace
```
  调用
- 使用预编译的正则表达式

总而言之：

# Create precompiled regexes.
# Note: As written, they make the matching that -replace performs
#       case-*sensitive* (and culture-sensitive), 
#       which speeds things up slightly.
#       If you need case-*insensitive* matching, use option argument
#       'Compiled, IgnoreCase' instead.
$reMatch    = New-Object regex '\bDownloaded\b', 'Compiled'
$reReplace1 = New-Object regex 'Downloaded |Traces from |\[', 'Compiled'
$reReplace2 = New-Object regex '\]:\s', 'Compiled'
$reReplace3 = New-Object regex ',.*', 'Compiled'

# The platform-appropriate newline sequence.
$nl = [Environment]::NewLine

foreach ($f in $files) {

  $date = $f.BaseName.Substring(22,8)

  # Read all lines into an array, filter and replace, then join the
  # resulting lines with newlines and append the resulting single string
  # to the log file.
  [IO.File]::AppendAllText($PWD.ProviderPath + '/CleanedLogs.txt',
    ([IO.File]::ReadAllLines($f.FullName) -match
      $reMatch -replace 
        $reReplace1 -replace 
          $reReplace2, ' ' -replace 
            $reReplace3, " $date" -join 
              $nl) + $nl
  )

}

请注意，每个文件必须作为一个行数组，加上一部分（作为数组和单个多行字符串），其大小取决于过滤的行数。

是否有一个包含所有匹配和替换操作的单一正则表达式？我试了两个小时，但不知道怎么做。我会尝试你的阅读和写作建议！Re正则表达式：可能不是；您至少需要1个

-match

来只选择感兴趣的行，然后开始替换（

-replace

不过滤，它会传递不匹配的行）。您至少可以将所有删除字符串的

-replace

操作合并为一个操作。我在

$Box=$Reader.ReadLine（）上收到一个错误

：

无法在C:\Users\jmoecke\PycharmProjects\TraceDashboardV3\Stackoverflow2.ps1:9个字符：9+$Box=$Reader.ReadLine（）+~~~~~~~~~~~~~~~~~~~~+CategoryInfo:InvalidOperation:（：）[]，RuntimeException+FullyQualifiedErrorId:InvokeMethodonFull

。你知道会出什么问题吗？代码对我来说很好。在不使用

编码参数的情况下尝试$Reader
，例如：$Reader=New Object-TypeName System.IO.StreamReader-ArgumentList$Stream
。您在代码中做了哪些更改？