批处理或powershell中的html抓取

批处理或powershell中的html抓取,html,powershell,batch-file,web-scraping,Html,Powershell,Batch File,Web Scraping,我需要抓取一个网站的html,它是从一个.url文件启动的,然后找到某一行,并抓取它下面的每一行到某一点。html代码的示例如下所示: (无)授权管理员和用户授权管理员:; 吉姆(你) 密码:;(空白/无) 上下快速移动 密码:;Littl3@birD 蝙蝠侠 密码:;3ndur4N(e&;home) 轻而快地擦掉 密码:;船长 授权用户:; 纸袋 蟹 奥利弗 詹姆斯 斯科特 厕所 苹果 竞争指引 我需要将所有授权管理员放入一个txt文件中,将授权用户

我需要抓取一个网站的html,它是从一个.url文件启动的,然后找到某一行,并抓取它下面的每一行到某一点。html代码的示例如下所示:

  • (无)
    • 授权管理员和用户授权管理员:; 吉姆(你) 密码:;(空白/无) 上下快速移动 密码:;Littl3@birD 蝙蝠侠 密码:;3ndur4N(e&;home) 轻而快地擦掉 密码:;船长 授权用户:; 纸袋 蟹 奥利弗 詹姆斯 斯科特 厕所 苹果 竞争指引

我需要将所有授权管理员放入一个txt文件中,将授权用户放入一个txt文件中,然后将两者都放入另一个txt文件中。这是否可以通过批处理和powershell实现?

这真的很难看,而且非常脆弱。一个好的HTML解析器将是更好的方法

然而,假设你没有足够的资源,这里有一种获取数据的方法。如果你真的想再生成两个文件[Admin&User],你可以从这个对象中完成

Name      UserType     
----      --------     
jim (you) Administrator
bob       Administrator
batman    Administrator
dab       Administrator
bag       User         
crab      User         
oliver    User         
james     User         
scott     User         
john      User         
apple     User
CSV文件内容

Name   Type  Password      
----   ----  --------      
jim    Admin (blank/none)  
bob    Admin Littl3@birD   
batman Admin 3ndur4N(e&home
dab    Admin captain 

这是我试图得到你想要的东西

Name   Type
----   ----
bag    User
crab   User
oliver User
james  User
scott  User
john   User
apple  User
users.csv

Name   Type  Password      
----   ----  --------      
jim    Admin (blank/none)  
bob    Admin Littl3@birD   
batman Admin 3ndur4N(e&home
dab    Admin captain       
bag    User                
crab   User                
oliver User                
james  User                
scott  User                
john   User                
apple  User 
# $html is assumed to contain the input HTML text (can be a full document).
$admins, $users = (
  # Split the HTML text into the sections of interest.
  $html -split
    '\A.*<b>Authorized Administrators&#58;</b>|<b>Authorized Users&#58;</b>' `
    -ne '' `
    -replace '<.*'
).ForEach({
  # Extract admin lines and user lines each, as an array.
  , ($_ -split '\r?\n' -ne '')
})

# Clean up the $admins array and transform the username-password pairs
# into custom objects with .username and .password properties.
$admins = $admins -split '\s+password&#58;\s+' -ne ''
$i = 0;
$admins.ForEach({ 
  if ($i++ % 2 -eq 0) { $co = [pscustomobject] @{ username = $_; password = '' } } 
  else { $co.password = $_; $co } 
})

# Create custom objects with the same structure for the users.
$users = $users.ForEach({
  [pscustomobject] @{ username = $_; password = '' }
})

# Output to CSV files.
$admins | Export-Csv admins.csv
$users | Export-Csv users.csv
$admins + $users | Export-Csv all.csv
adminsandusers.csv

Name   Type  Password      
----   ----  --------      
jim    Admin (blank/none)  
bob    Admin Littl3@birD   
batman Admin 3ndur4N(e&home
dab    Admin captain       
bag    User                
crab   User                
oliver User                
james  User                
scott  User                
john   User                
apple  User 
# $html is assumed to contain the input HTML text (can be a full document).
$admins, $users = (
  # Split the HTML text into the sections of interest.
  $html -split
    '\A.*<b>Authorized Administrators&#58;</b>|<b>Authorized Users&#58;</b>' `
    -ne '' `
    -replace '<.*'
).ForEach({
  # Extract admin lines and user lines each, as an array.
  , ($_ -split '\r?\n' -ne '')
})

# Clean up the $admins array and transform the username-password pairs
# into custom objects with .username and .password properties.
$admins = $admins -split '\s+password&#58;\s+' -ne ''
$i = 0;
$admins.ForEach({ 
  if ($i++ % 2 -eq 0) { $co = [pscustomobject] @{ username = $_; password = '' } } 
  else { $co.password = $_; $co } 
})

# Create custom objects with the same structure for the users.
$users = $users.ForEach({
  [pscustomobject] @{ username = $_; password = '' }
})

# Output to CSV files.
$admins | Export-Csv admins.csv
$users | Export-Csv users.csv
$admins + $users | Export-Csv all.csv
我相信这个答案展示了有用的技巧,并且我已经验证了它在所述的限制条件下与示例输入一起工作。如果你不同意,一定要告诉我们(用文字),这样答案可以改进

一般来说,如前所述,使用专用HTML解析器是更好的选择,但是考虑到输入中易于识别的封闭标记(假设没有变化),您可以使用基于正则表达式的解决方案

这是一个基于正则表达式的PSv4+解决方案,但请注意,它依赖于包含空格(换行符、前导空格)的输入,正如您的问题所示:

#$html假定包含输入的html文本(可以是完整文档)。
$admins,$users=(
#将HTML文本拆分为感兴趣的部分。
$html-拆分
“\A.*授权管理员和授权用户和”`
-ne“`
-取代
# $html is assumed to contain the input HTML text (can be a full document).
$admins, $users = (
  # Split the HTML text into the sections of interest.
  $html -split
    '\A.*<b>Authorized Administrators&#58;</b>|<b>Authorized Users&#58;</b>' `
    -ne '' `
    -replace '<.*'
).ForEach({
  # Extract admin lines and user lines each, as an array.
  , ($_ -split '\r?\n' -ne '')
})

# Clean up the $admins array and transform the username-password pairs
# into custom objects with .username and .password properties.
$admins = $admins -split '\s+password&#58;\s+' -ne ''
$i = 0;
$admins.ForEach({ 
  if ($i++ % 2 -eq 0) { $co = [pscustomobject] @{ username = $_; password = '' } } 
  else { $co.password = $_; $co } 
})

# Create custom objects with the same structure for the users.
$users = $users.ForEach({
  [pscustomobject] @{ username = $_; password = '' }
})

# Output to CSV files.
$admins | Export-Csv admins.csv
$users | Export-Csv users.csv
$admins + $users | Export-Csv all.csv