Bash 在不同行的两个字符串之间提取文本_Bash_Awk_Sed

Bash 在不同行的两个字符串之间提取文本

bash awk sed

Bash 在不同行的两个字符串之间提取文本,bash,awk,sed,Bash,Awk,Sed,我有一个包含以下随机主机的大型电子邮件文件： ...... HOSTS: test-host,host2.domain.com, host3.domain.com,another-testing-host,host.domain. com,host.anotherdomain.net,host2.anotherdomain.net, another-local-host, TEST-HOST DATE: August 11 2015 9:00 ....... 主机总是用逗号分隔，但它们可以在

我有一个包含以下随机主机的大型电子邮件文件：

......
HOSTS: test-host,host2.domain.com,
host3.domain.com,another-testing-host,host.domain.
com,host.anotherdomain.net,host2.anotherdomain.net,
another-local-host, TEST-HOST

DATE: August 11 2015 9:00
.......

主机总是用逗号分隔，但它们可以在一行、两行或多行上拆分（不幸的是，我无法控制这一点，这是电子邮件客户端所做的）

因此，我需要提取字符串“HOSTS:”和字符串“DATE:”之间的所有文本，将其包装，并用新行替换逗号，如下所示：

test-host
host2.domain.com
host3.domain.com
another-testing-host
host.domain.com
host.anotherdomain.net
host2.anotherdomain.net
another-local-host
TEST-HOST

到目前为止，我想到了这个，但我失去了与“主持人”相同的一切：

救命啊

perl -ne '
    if (my $l = (/^HOSTS:/ .. /^DATE:/)) {
        chomp;
        s/^HOSTS:\s+// if 1 == $l;
        s/DATE:.*// if $l =~ /E/;
        s/,\s*/\n/g;
        print;
    }' input-file > output-file

触发器运算符

。

返回一个数字，在这种情况下，表示当前块中的行号。因此，我们可以很容易地从第一行（

1==$l

）中删除

主机：

。最后一行可以通过数字后面的

E0

识别，这就是我们删除

日期的方式：…

类似的内容可能适合您：

sed -n '/HOSTS:/{:a;N;/DATE/!ba;s/[[:space:]]//g;s/,/\n/g;s/.*HOSTS:\|DATE.*//g;p}' "$file"

细分：

-n                       # Disable printing
/HOSTS:/ {               # Match line containing literal HOSTS:
  :a;                    # Label used for branching (goto)
  N;                     # Added next line to pattern space
  /DATE/!ba              # As long as literal DATE is not matched goto :a
  s/.*HOSTS:\|DATE.*//g; # Remove everything in front of and including literal HOSTS:
                         # and remove everything behind and including literal DATE 
  s/[[:space:]]//g;      # Replace spaces and newlines with nothing
  s/,/\n/g;              # Replace comma with newline
  p                      # Print pattern space
}

此awk单衬套可能有助于：

awk -v RS='HOSTS: *|DATE:' 'NR==2{gsub(/\n/,"");gsub(/,/,"\n");print}' input

另一个

awk

带

tr

$ awk '/^HOSTS:/{$1="";p=1} /^DATE:/{p=0} p' file | tr -d ' \n' | tr ',' '\n'; echo ""

test-host
host2.domain.com
host3.domain.com
another-testing-host
host.domain.com
host.anotherdomain.net
host2.anotherdomain.net
another-local-host
TEST-HOST

下面是另一个sed脚本，可能适合您：

sed -n '/HOSTS:/{:a;N;/DATE/!ba;s/[[:space:]]//g;s/,/\n/g;s/.*HOSTS:\|DATE.*//g;p}' "$file"

script.sed

/HOSTS:/,/DATE/ { 
    /DATE/! H;                        # append to HOLD space
    /DATE/ { g;                       # exchange HOLD and PATTERN space
             s/([\n ])|(HOSTS:)//g;   # remove unwanted strings
             s/,/\n/g;                # replace comma with newline
             p;                       # print
    }
}

这样使用：

sed-nrf script.sed yourfile

中间块应用于

主机：

和

日期

之间的行。在不符合<>代码>日期<代码>的中间块行被添加到保持空间，并且行匹配<代码>日期>代码>触发更长的动作。

您的bug是“代码>// >与您认为的空行不匹配（我想）。使用

/^$/d

或

/。/！d

。你最终会得到比你想要的更多的文本，但我想你可以从那里得到它……你应该提到，由于多字符RSUUOC，它是特定于gawk的。始终引用shell变量。不要使用所有大写的变量名。awk永远不需要sed等

{if（A==1）print；}

可以简单地写成

。awk中不需要伪分号。始终在脚本（例如sed）周围使用单引号，而不是双引号<代码>\s是GNU-sed-specific，因此您应该声明。我将此标记为正确答案。我对它进行了测试，它很有效，我也喜欢这个解释，但我会使用我自己的方法，这要感谢Jeff Y的建议。

awk 'sub(/^HOSTS: /,""){rec=""} /^DATE/{gsub(/ *, */,"\n",rec); print rec; exit} {rec = rec $0}' file
test-host
host2.domain.com
host3.domain.com
another-testing-host
host.domain.com
host.anotherdomain.net
host2.anotherdomain.net
another-local-host
TEST-HOST

awk 'sub(/^HOSTS: /,""){rec=""} /^DATE/{gsub(/ *, */,"\n",rec); print rec; exit} {rec = rec $0}' file
test-host
host2.domain.com
host3.domain.com
another-testing-host
host.domain.com
host.anotherdomain.net
host2.anotherdomain.net
another-local-host
TEST-HOST