如何将awk限制为仅搜索包含在特定HTML标记中的项目?

如何将awk限制为仅搜索包含在特定HTML标记中的项目?,awk,replace,Awk,Replace,我有这样一个AWK脚本,我将在一个文件上运行它: cat input.txt | awk 'gsub(/[^ ]*(fish|shark|whale)[^ ]*/,"(&)")' >> output.txt 这会在包含“鱼”、“鲨鱼”或“鲸鱼”字样的所有行中添加括号,例如: The whale asked the shark to swim elsewhere. The fish were unhappy. 通过脚本运行后,文件将变为: The (whale) asked

我有这样一个AWK脚本,我将在一个文件上运行它:

cat input.txt | awk 'gsub(/[^ ]*(fish|shark|whale)[^ ]*/,"(&)")' >> output.txt
这会在包含“鱼”、“鲨鱼”或“鲸鱼”字样的所有行中添加括号,例如:

The whale asked the shark to swim elsewhere.
The fish were unhappy.
通过脚本运行后,文件将变为:

The (whale) asked the (shark) to swim elsewhere.
The (fish) were unhappy.
该文件用HTML标记,我只需要在
标记之间进行替换

The whale asked <b>the shark to swim</b> elsewhere.
<b>The fish were</b> unhappy.
鲸鱼要求鲨鱼到别处游泳。 鱼不高兴。 这就变成了:

The whale asked <b> the (shark) to swim </b> elsewhere.
<b> The (fish) were </b> unhappy.
鲸鱼要求(鲨鱼)到别处游泳。 鱼不高兴。
  • 匹配的粗体标记永远不会放在不同的行上。起始
    标记始终与结束
    标记显示在同一行

如何将
awk
的搜索限制为仅搜索和修改在
标记之间找到的文本?

只要HTML标记不比这差,并且
spans不包含任何其他HTML标记,因此在Perl中相对容易:

$ cat data
The whale asked <b>the shark to swim</b> elsewhere.
<b>The fish were</b> unhappy.
The <b> dogfish and the sharkfin soup</b> were unscathed.
$ perl -pe 's/(<b>[^<]*)\b(fish|shark|whale)\b([^<]*<\/b>)/\1(\2)\3/g'  data | so
The whale asked <b>the (shark) to swim</b> elsewhere.
<b>The (fish) were</b> unhappy.
The <b> dogfish and the sharkfin soup</b> were unscathed.
$ 
$cat数据
鲸鱼让鲨鱼到别处游泳。
鱼不高兴。
狗鱼和鲨鱼翅汤安然无恙。

$perl-pe的/([^这里有一个使用
awk
的技术:

awk '/<b>/{f=1}/<\/b>/{f=0}f{gsub(/fish|shark|whale/,"(&)")}1' RS=' ' ORS=' ' file
The whale asked <b>the (shark) to swim</b> elsewhere.
<b>The (fish) were</b> unhappy.
awk'/{f=1}/{f=0}f{gsub(/fish | shark | whale/,“(&)”)}1'RS=''ORS=''文件
鲸鱼要求(鲨鱼)到别处游泳。
鱼不高兴。

阅读
匹配()
函数及其相关的awk变量,RSTART和RLENGTH。祝你好运。一个UOOC奖等着你。另外,我注意到
如果鲨鱼和鲸鱼一起游泳
,只有鲨鱼会被插入括号。如果这是个问题,你必须更加努力。如果必要,可以这样做-读者练习!