String 如何通过sed解析html-提取由两个字符串分隔的两个字符串-在不同的行上，按顺序_String_Bash_Sed_Extract_Shell_Python_Perl

String 如何通过sed解析html-提取由两个字符串分隔的两个字符串-在不同的行上，按顺序

string bash sed shell python perl

String 如何通过sed解析html-提取由两个字符串分隔的两个字符串-在不同的行上，按顺序,string,bash,sed,extract,shell,python,perl,String,Bash,Sed,Extract,Shell,Python,Perl,我有一个bash脚本： v1='value="' v2='" type' do_parse_html_file() { sed -n "s/.*${v1}//;s/${v2}.*//p" "${_SCRIPT_PATH}/IBlockListLists.html"|egrep '^http' >${_tmp_file} } 。。。它仅从html文件中提取URL。我想谈谈输出： somename URL somename URL ---输入html文件的示例如下所示： </

我有一个bash脚本：

v1='value="'
v2='" type'

do_parse_html_file() {
   sed -n "s/.*${v1}//;s/${v2}.*//p" "${_SCRIPT_PATH}/IBlockListLists.html"|egrep '^http' >${_tmp_file}
}

。。。它仅从html文件中提取URL。我想谈谈输出：

somename URL
somename URL

---输入html文件的示例如下所示：

</tr>
<tr class="alt01">
<td><b><a href="http://www.iblocklist.com/list.php?list=bcoepfyewziejvcqyhqo">iana-reserved</a></b></td>
<td>Bluetack</td>
<td><img style="border:0;" src="I-BlockList%20%7C%20Lists_files/star_4.png" alt="" height="15" width="75"></td>
<td><input style="width:200px; outline:none; border-style:solid; border-width:1px; border-color:#ccc;" id="bcoepfyewziejvcqyhqo" readonly="readonly" onclick="select_text('bcoepfyewziejvcqyhqo');" value="http://list.iblocklist.com/?list=bcoepfyewziejvcqyhqo&amp;fileformat=p2p&amp;archiveformat=gz" type="text"></td>
</tr>
<tr class="alt02">
<td><b><a href="http://www.iblocklist.com/list.php?list=cslpybexmxyuacbyuvib">iana-private</a></b></td>
<td>Bluetack</td>
<td><img style="border:0;" src="I-BlockList%20%7C%20Lists_files/star_4.png" alt="" height="15" width="75"></td>
<td><input style="width:200px; outline:none; border-style:solid; border-width:1px; border-color:#ccc;" id="cslpybexmxyuacbyuvib" readonly="readonly" onclick="select_text('cslpybexmxyuacbyuvib');" value="http://list.iblocklist.com/?list=cslpybexmxyuacbyuvib&amp;fileformat=p2p&amp;archiveformat=gz" type="text"></td>
</tr>


蓝钉
蓝钉

---结果应如下所示：

</tr>
<tr class="alt01">
<td><b><a href="http://www.iblocklist.com/list.php?list=bcoepfyewziejvcqyhqo">iana-reserved</a></b></td>
<td>Bluetack</td>
<td><img style="border:0;" src="I-BlockList%20%7C%20Lists_files/star_4.png" alt="" height="15" width="75"></td>
<td><input style="width:200px; outline:none; border-style:solid; border-width:1px; border-color:#ccc;" id="bcoepfyewziejvcqyhqo" readonly="readonly" onclick="select_text('bcoepfyewziejvcqyhqo');" value="http://list.iblocklist.com/?list=bcoepfyewziejvcqyhqo&amp;fileformat=p2p&amp;archiveformat=gz" type="text"></td>
</tr>
<tr class="alt02">
<td><b><a href="http://www.iblocklist.com/list.php?list=cslpybexmxyuacbyuvib">iana-private</a></b></td>
<td>Bluetack</td>
<td><img style="border:0;" src="I-BlockList%20%7C%20Lists_files/star_4.png" alt="" height="15" width="75"></td>
<td><input style="width:200px; outline:none; border-style:solid; border-width:1px; border-color:#ccc;" id="cslpybexmxyuacbyuvib" readonly="readonly" onclick="select_text('cslpybexmxyuacbyuvib');" value="http://list.iblocklist.com/?list=cslpybexmxyuacbyuvib&amp;fileformat=p2p&amp;archiveformat=gz" type="text"></td>
</tr>

亚娜保留酒店伊安娜私人酒店

---是否可以在一行命令上使用sed？如果是，请帮忙

列表的第一部分——“somename”总是在前面，而不是在下一行的URL后面，不必是第二行

>somename   ... is delimited by   'href="URL">'   and   '</a>'       on one line           
>URL ... is always delimited by   'value="'       and   '" type'     on any following line

>somename。。。在一行上由“href=”URL“>”和“”分隔
>URL。。。始终由以下任何一行上的“value=”和“type”分隔

谢谢你，
亲切问候。
M.

不是这样做的正确工具

我可以向您展示一些脚本，用HTML解析器在或（

ruby

，

java

，

php

）中实现。这些是适合这项工作的工具

这可能是本网站上讨论最多的问题，请参见

制作这个网站的其中一个家伙写道，

不是做这件事的合适工具

我可以向您展示一些脚本，用HTML解析器在或（

ruby

，

java

，

php

）中实现。这些是适合这项工作的工具

这可能是本网站上讨论最多的问题，请参见

制作这个网站的一个家伙写道，使用解析器。它们有很多，下面是一个使用

HTML:：TokeParser

的示例

script.pl的内容

：

#!/usr/bin/env perl

use warnings;
use strict;
use HTML::TokeParser;

my $p = HTML::TokeParser->new( shift ) || die;

while ( my $tag = $p->get_tag( 'a' ) ) { 
    printf qq|%s %s\n|, $p->get_text, $tag->[1]{href};
}

像这样运行：

perl-5.14.2 script.pl htmlfile

这将产生：

iana-reserved http://www.iblocklist.com/list.php?list=bcoepfyewziejvcqyhqo
iana-private http://www.iblocklist.com/list.php?list=cslpybexmxyuacbyuvib

使用解析器。它们有很多，下面是一个使用

HTML:：TokeParser

的示例

script.pl的内容

：

#!/usr/bin/env perl

use warnings;
use strict;
use HTML::TokeParser;

my $p = HTML::TokeParser->new( shift ) || die;

while ( my $tag = $p->get_tag( 'a' ) ) { 
    printf qq|%s %s\n|, $p->get_text, $tag->[1]{href};
}

像这样运行：

perl-5.14.2 script.pl htmlfile

这将产生：

iana-reserved http://www.iblocklist.com/list.php?list=bcoepfyewziejvcqyhqo
iana-private http://www.iblocklist.com/list.php?list=cslpybexmxyuacbyuvib

对于我的Xidel，它是一条单行线：

xidel "${_SCRIPT_PATH}/IBlockListLists.html" -e '//a/concat(., " ", @href)'

对于我的Xidel，它是一条单行线：

xidel "${_SCRIPT_PATH}/IBlockListLists.html" -e '//a/concat(., " ", @href)'