Powershell 从html元素中提取http链接
赛门铁克最近更改了他们的下载页面,并将其移至broadcom。此后,Powershell 从html元素中提取http链接,powershell,web-scraping,Powershell,Web Scraping,赛门铁克最近更改了他们的下载页面,并将其移至broadcom。此后,Invoke WebRequest无法获取v5i64.exe文件的http url 但是,在浏览器中使用开发人员工具查看页面主体部分内的元素级别时,可以找到http url 有人知道如何使用PowerShell提取每天更改的url吗 $webreq = Invoke-WebRequest "https://www.broadcom.com/support/security-center/definitions/download/
Invoke WebRequest
无法获取v5i64.exe
文件的http url
但是,在浏览器中使用开发人员工具查看页面主体部分内的元素级别时,可以找到http url
有人知道如何使用PowerShell提取每天更改的url吗
$webreq = Invoke-WebRequest "https://www.broadcom.com/support/security-center/definitions/download/detail?gid=sep"
$webreq.Links | Select href
通过ComObject使用IE
$ie = new-object -ComObject "InternetExplorer.Application"
$ie.visible=$True
while($ie.Busy) { Start-Sleep -Milliseconds 100 }
$IE.navigate2("https://www.broadcom.com/support/security-center/definitions/download/detail?gid=sep")
while ($IE.busy) {
start-sleep -milliseconds 1000 #wait 1 second interval to load page
}
然后通过
$ie.Document.IHTMLDocument3\u getElementsByTagName(“元素名称”)
查找元素。以下PowerShell脚本将提示您下载包含文本v5i64.exe
和HTTPS
的链接。这适用于PowerShell 5.1 for Windows。它不适用于PowerShell 6或7(PowerShell核心)
在Windows 10.0.18363.657、Internet Explorer 11.657.18362、PowerShell 5.1.18362.628上测试
$url = "https://www.broadcom.com/support/security-center/definitions/download/detail?gid=sep"
$outfile = "./v5i64.exe"
$ie = New-Object -ComObject "InternetExplorer.Application"
$ie.visible=$True
while($ie.Busy) {
Start-Sleep -Milliseconds 100
}
$ie.navigate2($url)
while($ie.ReadyState -ne 4 -or $ie.Busy) {
Start-Sleep -milliseconds 500
}
$ie.Document.getElementsByTagName("a") | % {
if ($_.ie8_href -like "*v5i64.exe") {
if ($_.ie8_href -like "https://*") {
$len = (Invoke-WebRequest $_.ie8_href -Method Head).Headers.'Content-Length'
Write-Host "File:" $_.ie8_href
Write-Host "Size:" $len
$confirm = Read-Host "Download file? [y/n]"
if ($confirm -eq "y") {
Write-Host "Downloading" $_.ie8_href
Invoke-WebRequest -Uri $_.ie8_href -OutFile $outfile
}
}
}
}
$ie.Stop()
$ie.Quit()
感谢您提出的解决方案。但是,以下是我使用的最终代码:
$SEP_last_link = ("http://definitions.symantec.com/defs/"+($SEP_last | Select-String release -NotMatch | select -Last 1))
$Symantec_folder = "C:\Download for DVD\Symantec"
$Symantec_filepath = "$Symantec_folder\$SEP_last"
if (!(Test-Path "$Symantec_filepath" -PathType Leaf)) {
Write-Host "`rStart to download Symantec $SEP_last file: $(Get-Date)`r"
$start_time = (Get-Date)
$webclient = New-Object System.Net.WebClient
$WebClient.DownloadFile($SEP_last_link, $Symantec_filepath)
Write-Host "`r$SEP_last file has been downloaded successfully`r" -ForegroundColor Green
$end_time = $(get-date) - $start_time
$total_time = "{0:HH:mm:ss}" -f ([datetime]$end_time.Ticks)
Write-Host "`rTime to download Symantec $SEP_last file: $total_time`r"
} else {
Write-Host "`rSymantec $SEP_last file already exists!`r" -ForegroundColor Yellow
}
Get-ChildItem -Path "$Symantec_Folder\*-v5i64.exe" -Exclude "$SEP_last" -Verbose –Force | Remove-Item
原因是该链接不是您正在下载的页面的一部分。赛门铁克正在从下载中构建页面,在初始页面加载后将进行后续处理。感谢John的及时反馈。在这种情况下,是否可以在temp变量中下载/转储页面,模拟页面的人工加载?然后提取链接?