Powershell 如何（高效地）将多个小文件的内容（行）与单个大文件的内容（行）匹配并更新/重新创建它们_Powershell

Powershell 如何（高效地）将多个小文件的内容（行）与单个大文件的内容（行）匹配并更新/重新创建它们

powershell

Powershell 如何（高效地）将多个小文件的内容（行）与单个大文件的内容（行）匹配并更新/重新创建它们,powershell,Powershell,我已尝试解决以下案例：许多小文本文件（在子文件夹中）需要其内容（行）与另一个（大）文本文件中存在的行相匹配。然后需要使用这些匹配行更新或复制小文件我能够为此编写一些运行代码，但我需要对其进行改进或使用一个完整的其他方法，因为它非常慢，需要>40小时才能完成所有文件我已经有了一个想法，就是使用SQL Server批量导入具有[relative path]、[filename]、[jap content]的单个表中的所有文件，以及具有[jap content]、[eng content]的表中

我已尝试解决以下案例：

许多小文本文件（在子文件夹中）需要其内容（行）与另一个（大）文本文件中存在的行相匹配。然后需要使用这些匹配行更新或复制小文件

我能够为此编写一些运行代码，但我需要对其进行改进或使用一个完整的其他方法，因为它非常慢，需要>40小时才能完成所有文件

我已经有了一个想法，就是使用SQL Server批量导入具有[relative path]、[filename]、[jap content]的单个表中的所有文件，以及具有[jap content]、[eng content]的表中的翻译文件，然后加入[jap content]，并使用[relative path]、[filename]将加入的表作为单独的文件批量导出。不幸的是，由于格式和编码问题，我一开始就被卡住了，所以我放弃了它，开始编写PowerShell脚本

现在详细介绍：

超过40k的txt文件分布在多个子文件夹中，每个子文件夹有多行，每行可以存在于多个文件中

 Content:

 UTF8 encoded Japanese text that also can contain special characters like \\[*+(), each Line ending with a tabulator character. Sounds like csv files but they don't have headers.

一个大于600k行的大文件，包含小文件的翻译。此文件中的每一行都是唯一的

 Content:

 Again UTF8 encoded Japanese text. Each line formatted like this (without brackets):

 [Japanese Text][tabulator][English Text]

 Example:

 テスト[1]  Test [1]

最终结果应该是所有这些小文件的副本或更新版本，其中它们的行被替换为翻译文件的匹配行，同时保持它们的相对路径

我现在所拥有的：

$translationfile = 'B:\Translation.txt'
$inputpath = 'B:\Working'

$translationarray = [System.Collections.ArrayList]@()
$translationarray = @(Get-Content $translationfile -Encoding UTF8)

Get-Childitem -path $inputpath -Recurse -File -Filter *.txt | ForEach-Object -Parallel {
    $_.Name
    $filepath = ($_.Directory.FullName).substring(2) 
    $filearray = [System.Collections.ArrayList]@()
    $filearray = @(Get-Content -path $_.FullName -Encoding UTF8)
    $filearray = $filearray | ForEach-Object {
        $result = $using:translationarray -match ("^$_" -replace '[[+*?()\\.]','\$&')
        if ($result) {
            $_ = $result
        }
        $_
    }
    If(!(test-path B:\output\$filepath)) {New-Item -ItemType Directory -Force -Path B:\output\$filepath}
    #$("B:\output\"+$filepath+"\")
    $filearray | Out-File -FilePath $("B:\output\" + $filepath + "\" + $_.Name) -Force -Encoding UTF8
} -ThrottleLimit 10

我将感谢任何帮助和想法，但请记住，我很少写脚本，所以任何复杂的东西可能会飞到我的头上

感谢诸位，使用a是将纯日语短语映射到双语行的最佳选择

此外，使用.NET API进行文件I/O可以显著加快操作速度

# Be sure to specify all paths as full paths, not least because .NET's 
# current directory usually differs from PowerShell's
$translationfile = 'B:\Translation.txt'
$inPath = 'B:\Working'
$outPath = (New-Item -Type Directory -Force 'B:\Output').FullName

# Build the hashtable mapping the Japanese phrases to the full lines.
# Note that ReadLines() defaults to UTF-8
$ht = @{ }
foreach ($line in [IO.File]::ReadLines($translationfile)) {
  $ht[$line.Split("`t")[0] + "`t"] = $line
}

Get-ChildItem $inPath -Recurse -File -Filter *.txt | Foreach-Object -Parallel {
  # Translate the lines to the matching lines including the $translation
  # via the hashtable.
  # NOTE: If an input line isn't represented as a key in the hashtable,
  #       it is passed through as-is.
  $lines = foreach ($line in [IO.File]::ReadLines($_.FullName)) {
    ($using:ht)[$line] ?? $line
  }
  # Synthesize the output file path, ensuring that the target dir. exists.
  $outFilePath = (New-Item -Force -Type Directory ($using:outPath + $_.Directory.FullName.Substring(($using:inPath).Length))).FullName + '/' + $_.Name
  # Write to the output file.
  # Note: If you want UTF-8 files *with BOM*, use -Encoding utf8bom
  Set-Content -Encoding utf8 $outFilePath -Value $lines
} -ThrottleLimit 10

注意：您使用的

ForEach Object-Parallel

意味着您使用的是PowerShell[Core]7+，其中无BOM的UTF-8是一致的默认编码（与Windows PowerShell不同，Windows PowerShell中的默认编码差别很大）

因此，除了在

foreach

循环中使用.NET

[IO.File]：：ReadLines（）

API之外，您还可以在

-File

参数中使用更具PowerShell风格的语句，以实现高效的逐行文本文件处理。

您目前必须进行m*n regex匹配，这将非常缓慢。您应该只解析翻译文件一次以构建哈希表（日文文本作为键，英文文本作为值）。哈希表允许比数组更快的查找（类似于索引SQL表）。使用

String.split（）

代替regex进一步提高性能。感谢您的评论。我读过哈希表的这一优点，但在实现上遇到了困难。