使用bash从html表中提取单元格值_Html_Regex_Bash_Html Parsing

使用bash从html表中提取单元格值

html regex bash

使用bash从html表中提取单元格值,html,regex,bash,html-parsing,Html,Regex,Bash,Html Parsing,我试图创建一个BASH/Perl脚本，该脚本将从动态html表中获取特定值这是我的网页样本 <table border="1" bordercolor="#FFCC00" style="background-color:#FFFFCC" width="100%" cellpadding="3" cellspacing="3"> <tr align="center"> <th>Environment</th><th>Release

我试图创建一个BASH/Perl脚本，该脚本将从动态html表中获取特定值

这是我的网页样本

<table border="1" bordercolor="#FFCC00" style="background-color:#FFFFCC" width="100%" cellpadding="3" cellspacing="3"> <tr align="center"> <th>Environment</th><th>Release Track</th><th>Artifact</th><th>Name</th><th>Build #</th><th>Cert Idn</th><th>Build Idn</th><th>Request Status</th><th>Update Time</th><th>Log Info.</th><th>Initiator</th> </tr> <tr> <td>DEV03</td><td>2.1.0</td><td>abpa</td><td>ecom-abpa-ear</td><td>204</td><td>82113</td><td>171242</td><td>Deployed</td><td>3/18/2013 3:10:58 PM</td><td width="70">Log info</a></td><td>CESAR</td> </tr> <tr> <td>DEV03</td><td>2.1.0</td><td>abpa</td><td>abpa_dynamic_config_properties</td><td>20</td><td>82113</td><td>167598</td><td>Deployed</td><td>3/18/2013 2:32:27 PM</td><td width="70">Log info</a></td><td>CESAR</td> </tr> </table> 但这只适用于“已部署”

有什么想法吗？

您应该使用这样的解析器来实现这一点

使用

xmllint

可以基于xpath提取元素

例如：

$ xmllint --html --format --shell file.html <<< "cat //table/tr/td[position()=8]/text()"
/ >  -------
Deployed
 -------
Deployed
/ >

$xmllint--html--format--shell file.html

在上面的命令中，xpath

//table/tr/td[position（）=8]/text（）

，从第8个表列返回值。

快速且脏：

cat your_html_file | perl -pe "s/^<\/?table.*$//g;s/^<tr .*$//g;s/<tr> (<td>.*?){8}//g;s/<th.*$//g;s/<\/.*$//g" | sed '/^$/d'

cat-your|html|文件| perl-pe“s/^您可以尝试使用以下包装器：
为了使用它，您必须使您的html格式更加完善（我必须删除
才能使脚本正常工作）。
您也可以使用my获得第8列中的所有内容：
xidel your_table.html -e '//table//tr/td[8]'

或者，如果列位置也可以更改，请先获取列编号：
xidel your_table.html -e 'column:=count(//table//th[.="Request Status"]/preceding-sibling::*)+1' -e '//table//tr/td[$column]'

请注意，文档输出格式不正确（缺少一些打开的）
命令
我喜欢使用简单直接的XPath进行简短测试：
xmlstarlet sel -t -m "//table/tr/td[position()=8]" -v "./text()" -n 

解释
您提到了Perl。所以请使用。不要使用正则表达式来解析HTML。您无法使用正则表达式可靠地解析HTML，并且您将面临悲伤和挫折。一旦HTML与您的期望值发生变化，您的代码将被破坏。有关如何使用已使用正则表达式的Perl模块正确解析HTML的示例，请参阅即使是编写、测试和调试的。真的不使用正则表达式：您的文档输出格式不正确，是正常的/例外的还是打字错误？这是一个格式良好的版本：我无法查看格式良好的版本，因为我的工作阻止了这些链接看起来像一个不错的概念/工具
xidel your_table.html -e 'column:=count(//table//th[.="Request Status"]/preceding-sibling::*)+1' -e '//table//tr/td[$column]'

xmlstarlet sel -t -m "//table/tr/td[position()=8]" -v "./text()" -n 

sel   (or select)        - Select data (mode) or query XML document(s) (XPATH, etc)
-t or --template         - start a template
-m or --match <xpath>    - match XPATH expression
-v or --value-of <xpath> - print value of XPATH expression
-n or --nl               - print new line

Deployed
Deployed
# plus empty-cell