Linux 命令行搜索所有html文件，检索属性值_Linux_Command Line_Sed_Find_Bash4

Linux 命令行搜索所有html文件，检索属性值

linux command-line sed

Linux 命令行搜索所有html文件，检索属性值,linux,command-line,sed,find,bash4,Linux,Command Line,Sed,Find,Bash4,我需要从位于不同子目录中的一堆html文件中获取所有内联“数据标题”属性值。有没有一种简单的方法可以在linux机器上实现这一点我在另一篇SO帖子中发现了类似的内容，并尝试对其进行编辑，但我是sed新手： sed "s/.* data-title=\"\(.*\)\".*/\1/" 我还不能完全正确地获得这一部分，我想我需要使用一个额外的搜索工具来实现这一点。理想情况下，我希望将所有输出都保存到txt文件中样本： <aside class="grid-sidebar side

我需要从位于不同子目录中的一堆html文件中获取所有内联“数据标题”属性值。有没有一种简单的方法可以在linux机器上实现这一点

我在另一篇SO帖子中发现了类似的内容，并尝试对其进行编辑，但我是sed新手：

sed "s/.* data-title=\"\(.*\)\".*/\1/"

我还不能完全正确地获得这一部分，我想我需要使用一个额外的搜索工具来实现这一点。理想情况下，我希望将所有输出都保存到txt文件中

样本：

    <aside class="grid-sidebar sidebar">
        <div id="listLoading"><div id="loading-listLoading" class="front-center" style="padding-top: 22%; top: 0%; display: none;"><div style="width: 42px; height: 42px; position: absolute; margin-top: 17px; margin-left: -21px; -webkit-animation: spin12 0.8s linear infinite;"><svg style="width: 42px; height: 42px;"><g transform="translate(21,21)"><g stroke-width="4" stroke-linecap="round" stroke="rgb(34, 34, 34)"><line x1="0" y1="11" x2="0" y2="18" transform="rotate(0, 0, 0)" opacity="1"></line><line x1="0" y1="11" x2="0" y2="18" transform="rotate(30, 0, 0)" opacity="0.9173553719008265"></line><line x1="0" y1="11" x2="0" y2="18" transform="rotate(60, 0, 0)" opacity="0.8347107438016529"></line><line x1="0" y1="11" x2="0" y2="18" transform="rotate(90, 0, 0)" opacity="0.7520661157024794"></line><line x1="0" y1="11" x2="0" y2="18" transform="rotate(120, 0, 0)" opacity="0.6694214876033058"></line><line x1="0" y1="11" x2="0" y2="18" transform="rotate(150, 0, 0)" opacity="0.5867768595041323"></line><line x1="0" y1="11" x2="0" y2="18" transform="rotate(180, 0, 0)" opacity="0.5041322314049588"></line><line x1="0" y1="11" x2="0" y2="18" transform="rotate(210, 0, 0)" opacity="0.42148760330578516"></line><line x1="0" y1="11" x2="0" y2="18" transform="rotate(240, 0, 0)" opacity="0.33884297520661155"></line><line x1="0" y1="11" x2="0" y2="18" transform="rotate(270, 0, 0)" opacity="0.25619834710743805"></line><line x1="0" y1="11" x2="0" y2="18" transform="rotate(300, 0, 0)" opacity="0.17355371900826455"></line><line x1="0" y1="11" x2="0" y2="18" transform="rotate(330, 0, 0)" opacity="0.09090909090909094"></line></g></g></svg></div></div></div>
        <div id="list" style="position:relative;">
<div style="height: 55px;">
    <h2 class="heading" style="margin-bottom: 10px">Available Records</h2>
</div>
<div style="height: 51px">
            <div class="grid-3-4">
            <label for="searchInput" class="infield" style="position: absolute; left: 0px; top: 55px; display: block;">Search</label>
            <input id="searchInput" type="text" name="searchInput" data-title="title1" title="" style="height: 36px" class="input-long">
    </div>
    <div class="grid-1-4">
    <select id="listStatus" name="status" class="styled input-full hasCustomSelect" data-title="Title 2" title="" style="-webkit-appearance: menulist-button; width: 104px; position: absolute; opacity: 0; height: 36px; font-size: 16px;">
        <option value="all">All</option>
        <option value="active" selected="">Active</option>
        <option value="archived">Archived</option>
    </select><span class="customSelect styled input-full" style="display: inline-block;"><span class="customSelectInner" style="width: 100%; display: inline-block;">Active</span></span>
    </div>
</div>
    </aside>


可用记录
搜寻
全部的
活跃的
存档
活跃的

是，使用

xmllint

（正则表达式不是解析HTML的正确工具）：

或与：

其中node是包含title元素的节点的名称

编辑

也不是

xmllint

或

xmlstarlet

可以正确解析此HTML。一个快速工作的方法是使用：

grep -oP 'data-title="\K[^"]+' *files

或者，您可以使用（e）grep

grep-e'.*.*.html

egrep.*？“*.html

从文件夹中

使用

grep-re'.*.*/*.html

解析子目录和

grep-rhe'.*.*/*.html

如果只需要标题行，则可以解析子目录并省略文件名显示。

如果需要，可以使用sed并拉出标题标记数据，如果需要从某些元链接数据中获取，则必须对其进行更改：

sed -n 's#.*<title>\(.*\)</title>.*#\1#p' *.html

否则，您需要将其修改为多行匹配（仍然可以使用sed完成）

OP说他有很多子目录，regex不是用来解析HTMLI的，我知道问题是关于LINUX的。只想补充一点，即两个xmllint命令在OSX上都失败。失败是什么意思？您可以安装

xmllint

，当尝试运行这些时，看到我得到了“未知选项--xpath”，而且，如果所有节点名称都不同怎么办？您应该升级

xmllint

或使用

xmlstarlet

。好的，我已经升级了xmllint，现在我得到了标记错误，我想我应该提到它是html5？

grep -oP 'data-title="\K[^"]+' *files

sed -n 's#.*<title>\(.*\)</title>.*#\1#p' *.html

sed -n "/title=/s/.* title=\"\(.*\)\".*/\1/p"