Powershell 两个文件：仅保留前n个字符相同的行_Powershell

Powershell 两个文件：仅保留前n个字符相同的行

powershell

Powershell 两个文件：仅保留前n个字符相同的行,powershell,Powershell,CWD中有两个文本文件，a.txt，b.txt。从a.txt，我想删除b.txt中前5个字符不存在的所有行，作为任何行的前5个字符。（或者，另作说明，仅保留a.txt中的行，其前5个字符与任何行的前5个字符一样出现在b.txt中。）行末第5个字符后的内容不相关例如：a.txt abcde000dsdsddsdsdsdsdsd 0123456xxx kkk xyzxyzxyzfeeeee kkkkkkkkkkk 预期结果（a.txt中的行，其1-5个字符出现在b.txt中）：试着这样做： $l

CWD中有两个文本文件，

a.txt

，

b.txt

。从

a.txt

，我想删除
b.txt
中前5个字符不存在的所有行，作为任何行的前5个字符。（或者，另作说明，仅保留
a.txt
中的行，其前5个字符与任何行的前5个字符一样出现在
b.txt
中。）行末第5个字符后的内容不相关
例如：
a.txt
abcde000dsdsddsdsdsdsdsd 0123456xxx kkk xyzxyzxyzfeeeee kkkkkkkkkkk 预期结果（
a.txt
中的行，其1-5个字符出现在
b.txt
中）：

试着这样做：

$listB=get-content "c:\temp\b.txt" | where {$_.Length -gt 4} | select @{N="First5";E={$_.Substring(0, 5)}} get-content "c:\temp\a.txt" | where {$_.Length -gt 4 -and $_.Substring(0, 5) -in $listB.First5}

$pattern = '^(.{5}).*' $ref = (Get-Content 'b.txt') -match $pattern -replace $pattern, '$1' | Get-Unique Get-Content 'a.txt' | Where-Object { $ref -contains ($_ -replace $pattern, '$1') } | Set-Content 'results.txt'

您的代码无法工作，因为您的模式与任何内容都不匹配。正则表达式
^[5]
表示“字符串开头的字符“5”（方括号定义了a），而不是“字符串开头的5个字符”。后者是
^.{5}
。此外，您从未将
a.txt
的内容与
b.txt
的内容进行匹配
有几种方法可以满足您的需求：

将
b.txt.
的每一行的前5个字符提取到一个数组中，并将
a.txt
的行与该数组进行比较。某种程度上使用这种方法，但需要PowerShell v3或更高版本。适用于所有PowerShell版本的变体可能如下所示：

$listB=get-content "c:\temp\b.txt" | where {$_.Length -gt 4} | select @{N="First5";E={$_.Substring(0, 5)}} get-content "c:\temp\a.txt" | where {$_.Length -gt 4 -and $_.Substring(0, 5) -in $listB.First5}

$pattern = '^(.{5}).*' $ref = (Get-Content 'b.txt') -match $pattern -replace $pattern, '$1' | Get-Unique Get-Content 'a.txt' | Where-Object { $ref -contains ($_ -replace $pattern, '$1') } | Set-Content 'results.txt'

由于数组中的查找速度相对较慢，且不能很好地扩展（随着数组中元素数量的增加，查找速度明显变慢），因此还可以将参考值放入索引中，以便进行索引查找（速度明显更快）：

另一种选择是从
b.txt
中提取的子字符串构建第二个字符串，并将
a.txt
的内容与该表达式进行比较：

$pattern = '^(.{5}).*' $list = (Get-Content 'b.txt') -match $pattern -replace $pattern, '$1' | Get-Unique | ForEach-Object { [regex]::Escape($_) } $ref = '^({0})' -f ($list -join '|') (Get-Content 'a.txt') -match $ref | Set-Content 'results.txt'

注意，这些方法中的每一个将忽略短于5个字符的行。
< P>如果性能是一个问题，考虑使用哈希表作为索引：

$Pattern = '^(.{5}).*' $a = @{}; $b = @{} Get-Content -Path a.txt | Where {$_ -Match $Pattern} | ForEach {$a[$Matches[1]] = @($a[$Matches[1]] + $_)} Get-Content -Path b.txt | Where {$_ -Match $Pattern} | ForEach {$b[$Matches[1]] = @($b[$Matches[1]] + $_)} $a.Keys | Where {$b.Keys -Contains $_} | ForEach {$a.$_} | Set-Content results.txt

谢谢你提供的细节。如果
a.txt
和
b.txt
都有100多万行（每行是csv，4个值[列]，最大行长400个字符），您建议使用哪种解决方案？可能是第二种。首先运行一些基准测试以确保。出于性能原因，可能需要进行一些调整。
$pattern = '^(.{5}).*' $list = (Get-Content 'b.txt') -match $pattern -replace $pattern, '$1' | Get-Unique | ForEach-Object { [regex]::Escape($_) } $ref = '^({0})' -f ($list -join '|') (Get-Content 'a.txt') -match $ref | Set-Content 'results.txt'

$Pattern = '^(.{5}).*' $a = @{}; $b = @{} Get-Content -Path a.txt | Where {$_ -Match $Pattern} | ForEach {$a[$Matches[1]] = @($a[$Matches[1]] + $_)} Get-Content -Path b.txt | Where {$_ -Match $Pattern} | ForEach {$b[$Matches[1]] = @($b[$Matches[1]] + $_)} $a.Keys | Where {$b.Keys -Contains $_} | ForEach {$a.$_} | Set-Content results.txt