在Bash中将HTML表解析为变量_Html_Bash

在Bash中将HTML表解析为变量

html bash

在Bash中将HTML表解析为变量,html,bash,Html,Bash,我试图解决这个问题：我得到了HTML源代码，我想将表及其内容提取到变量中。例如： <table> content1 </table> some more code <table> content2 </table> 我得到： <table> content1 </table> 内容1 没有如何区分这些表的标识符。你知道怎么解决这个问题吗谢谢我将分解我尝试使用的答案，该答案支持解析html文件的--html标志 $

我试图解决这个问题：我得到了HTML源代码，我想将表及其内容提取到变量中。例如：

<table>
content1
</table>
some more code
<table>
content2
</table>

我得到：

<table>
content1
</table>


内容1

没有如何区分这些表的标识符。你知道怎么解决这个问题吗

谢谢

我将分解我尝试使用的答案，该答案支持解析

html

文件的

--html

标志

$ echo "cat //html/body/table" |  xmllint --html --shell YourHTML.html | sed '/^\/ >/d' | sed 's/<[^>]*.//g' | tr -d '\n' | awk -F"-------" '{print $1,$2}'
content1  content2

现在是值提取部分；执行的步骤如下：-

启动从根节点到重复节点的文件解析（

//html/body/table

），并在html解析器和交互式shell模式下运行

xmllint

（

xmllint--html--shell

）

运行该命令显然会产生一个结果

/ >  -------
<table>
content1
</table>

 -------
<table>
content2
</table>
/ >

现在使用

tr

从上述命令中删除换行符，以便

awk

可以使用字段分隔符处理记录，如

----

content1 -------content2

上述输出上的

awk

命令将根据需要生成文件

awk-F“-----”{print$1，$2}

content1  content2

把它放在一个shell脚本中，看起来像

#!/bin/bash

# extract table1 value
table1Val=$(echo "cat //html/body/table" |  xmllint --html --shell YourHTML.html | sed '/^\/ >/d' | sed 's/<[^>]*.//g' | tr -d '\n' | awk -F"-------" '{print $1}')

# extract table2 value
table2Val=$(echo "cat //html/body/table" |  xmllint --html --shell YourHTML.html | sed '/^\/ >/d' | sed 's/<[^>]*.//g' | tr -d '\n' | awk -F"-------" '{print $2}')

# can be extended up-to any number of nodes

#/bin/bash
#提取表1的值
table1Val=$（echo“cat//html/body/table”| xmllint--html--shell YourHTML.html | sed'/^\/>/d'| sed's/]*.//g'| tr-d'\n'| awk-F'----“{print$1}”）
#提取表2的值
table2Val=$（echo“cat//html/body/table”| xmllint--html--shell YourHTML.html | sed'/^\/>/d'| sed's/]*.//g'| tr-d'\n'| awk-F'----“{print$2}”）
#可以扩展到任意数量的节点

或者很简单：-

#!/bin/bash


echo "cat //html/body/table" |  xmllint --html --shell file | sed '/^\/ >/d' | \
    sed 's/<[^>]*.//g' | tr -d '\n' | awk -F"-------" '{print $1,$2}' | \
        while IFS= read -r value1 value2
        do
            # Do whatever with the values extracted
        done

#/bin/bash
echo“cat//html/body/table”| xmllint--html--shell文件| sed'/^\/>/d'|\
sed's/]*.//g'| tr-d'\n'| awk-F'----“{print$1，$2}”\
而IFS=读取-r值1值2
做
#对提取的值执行任何操作
完成

注意：-通过减少

awk

sed

命令组合的数量，可以减少/简化命令的数量。这只是一个有效的解决方案。我使用的

xmllint

版本是

xmllint：使用libxml版本20706

使用reg公式我已经尽了最大努力，但我不知道如何制作可以分隔这些表的regex。抱歉，大量工作。。。这个答案很好，但对于那个例子来说太具体了。我没有确切的桌子数量。不过，它帮助我找到了一种思考这个问题的方法，谢谢。@Majzlik:您可以修改它以指向您的节点并相应地提取信息。顺便说一句，这是具体的，因为你只提供了该文件。是的，我知道，对不起，这不是批评。再次感谢你的回答。

content1 -------content2

content1  content2

#!/bin/bash

# extract table1 value
table1Val=$(echo "cat //html/body/table" |  xmllint --html --shell YourHTML.html | sed '/^\/ >/d' | sed 's/<[^>]*.//g' | tr -d '\n' | awk -F"-------" '{print $1}')

# extract table2 value
table2Val=$(echo "cat //html/body/table" |  xmllint --html --shell YourHTML.html | sed '/^\/ >/d' | sed 's/<[^>]*.//g' | tr -d '\n' | awk -F"-------" '{print $2}')

# can be extended up-to any number of nodes

#!/bin/bash


echo "cat //html/body/table" |  xmllint --html --shell file | sed '/^\/ >/d' | \
    sed 's/<[^>]*.//g' | tr -d '\n' | awk -F"-------" '{print $1,$2}' | \
        while IFS= read -r value1 value2
        do
            # Do whatever with the values extracted
        done