如何使用PowerShell从HTML表中提取特定值并将其水平复制到文件中？_Html_Regex_Powershell

如何使用PowerShell从HTML表中提取特定值并将其水平复制到文件中？

html regex powershell

如何使用PowerShell从HTML表中提取特定值并将其水平复制到文件中？,html,regex,powershell,Html,Regex,Powershell,代码 select-string -Path "input.txt" -Pattern '<td>[A-Z][a-z]+' -AllMatches | % { $_.Matches } | % { $_.Value } > 'outcome.txt' Amsterdam 900K Rotterdam 700K The Hague 500K Utrecht 300K foreach ($line in [System.IO.File]::ReadLines("input.tx

代码

select-string -Path "input.txt" -Pattern '<td>[A-Z][a-z]+' -AllMatches | % { $_.Matches } | % { $_.Value } > 'outcome.txt'

Amsterdam 900K
Rotterdam 700K
The Hague 500K
Utrecht 300K

foreach ($line in [System.IO.File]::ReadLines("input.txt")) {
#  if ($line -match '<td>(.*)</td>\n<td>(\d+)</td>') {
  if ($line -match '<td>(.*)(</td>)') {  
     $matches[1] + $matches[2]
  }  
}

输入

<table>
  <tr>
    <th>City</th>
    <th>Population</th>
  </tr>
  <tr>
    <td>Amsterdam</td>
    <td>900K</td>
  </tr>
  <tr>
    <td>Rotterdam</td>
    <td>700K</td>
  </tr>
  <tr>
    <td>The Hague</td>
    <td>500K</td>
  </tr>
  <tr>
    <td>Utrecht</td>
    <td>300K</td>
  </tr>  
</table>

预期结果

select-string -Path "input.txt" -Pattern '<td>[A-Z][a-z]+' -AllMatches | % { $_.Matches } | % { $_.Value } > 'outcome.txt'

Amsterdam 900K
Rotterdam 700K
The Hague 500K
Utrecht 300K

foreach ($line in [System.IO.File]::ReadLines("input.txt")) {
#  if ($line -match '<td>(.*)</td>\n<td>(\d+)</td>') {
  if ($line -match '<td>(.*)(</td>)') {  
     $matches[1] + $matches[2]
  }  
}

问题

水平显示

首先，outcome.txt和outcome2.txt的结果可以手动合并，但这只是一个示例，实际文件包含数千行和100多列

特定提取

第二，实际的正则表达式要广泛得多，行可以包含500个字符，并且应该执行特定的get，例如，在

乌得勒支

的情况下，预期结果是

乌得勒支

而不是

乌得勒支

更新

select-string -Path "input.txt" -Pattern '<td>[A-Z][a-z]+' -AllMatches | % { $_.Matches } | % { $_.Value } > 'outcome.txt'

Amsterdam 900K
Rotterdam 700K
The Hague 500K
Utrecht 300K

foreach ($line in [System.IO.File]::ReadLines("input.txt")) {
#  if ($line -match '<td>(.*)</td>\n<td>(\d+)</td>') {
  if ($line -match '<td>(.*)(</td>)') {  
     $matches[1] + $matches[2]
  }  
}

foreach（[System.IO.File]中的行：：ReadLines（“input.txt”））{
#如果（$line-match'（.*）\n（\d+））{
如果（$line-match'（.*）（）'）{
$matches[1]+$matches[2]
}  
}

结果：

Amsterdam</td>
900K</td>
Rotterdam</td>
700K</td>
The Hague</td>
500K</td>
Utrecht</td>
300K</td>

阿姆斯特丹 900K 鹿特丹 700K 海牙 500K 乌得勒支 300K

当前的问题是，

out注释\n

将与第二行不匹配，而测试表明可以使用第二个括号提取第二个元素。

要采用另一种方法，已经有人创建了cmdlet，通过将表转换为对象来为您完成繁重的工作。从这是乔尔·贝内特的功劳

function ConvertFrom-Html {
   #.Synopsis
   #   Convert a table from an HTML document to a PSObject
   #.Example
   #   Get-ChildItem | Where { !$_.PSIsContainer } | ConvertTo-Html | ConvertFrom-Html -TypeName Deserialized.System.IO.FileInfo
   #   Demonstrates round-triping files through HTML
   param(
      # The HTML content
      [Parameter(ValueFromPipeline=$true)]
      [string]$html,

      # A TypeName to inject to PSTypeNames 
      [string]$TypeName
   )
   begin { $content = "$html" }
   process { $content += "$html" }
   end {
      [xml]$table = $content -replace '(?s).*<table[^>]*>(.*)</table>.*','<table>$1</table>'

      $header = $table.table.tr[0]  
      $data = $table.table.tr[1..1e3]

      foreach($row in $data){ 
         $item = @{}

         $h = "th"
         if(!$header.th) {
            $h = "td"
         }
         for($i=0; $i -lt $header.($h).Count; $i++){
            if($header.($h)[$i] -is [string]) {
               $item.($header.($h)[$i]) = $row.td[$i]
            } else {
               $item.($header.($h)[$i].InnerText) = $row.td[$i]
            }
         }
         Write-Verbose ($item | Out-String)
         $object = New-Object PSCustomObject -Property $item 
         if($TypeName) {
            $Object.PSTypeNames.Insert(0,$TypeName)
         }
         Write-Output $Object
      }
   }
}

这应该更容易处理，这取决于你们要去哪里…比如说导出CSV或者类似的东西。有了数据作为一个对象，你们几乎可以去任何地方