Regex 正则表达式匹配多行linux bash

Regex 正则表达式匹配多行linux bash,regex,sed,grep,Regex,Sed,Grep,我有一个很大的.warc文件,其中包含很多记录。我想在bash脚本中从中提取标题 我们来看看 文件如下所示: WARC/1.0 WARC-Type: response Content-Length: 2597724 WARC-Date: 2016-05-07T03:36:46Z WARC-Payload-Digest: sha1:33a3973a118293e4f8831449cc37095d645a57b3 WARC-Target-URI: url Content-Type: applicat

我有一个很大的.warc文件,其中包含很多记录。我想在bash脚本中从中提取标题

我们来看看

文件如下所示:

WARC/1.0
WARC-Type: response
Content-Length: 2597724
WARC-Date: 2016-05-07T03:36:46Z
WARC-Payload-Digest: sha1:33a3973a118293e4f8831449cc37095d645a57b3
WARC-Target-URI: url
Content-Type: application/http; msgtype=response
WARC-Record-ID: <urn:uuid:ecc531d0-1404-11e6-a7dc-002590c8c43c>

<!DOCTYPE html>
//some html code

WARC/1.0
WARC-Type: response
Content-Length: 2106841
WARC-Date: 2016-05-07T03:36:51Z
WARC-Payload-Digest: sha1:826fcc2ef666e2cfbcff9e4329a293141077a20e
WARC-Target-URI: url
Content-Type: application/http; msgtype=response
WARC-Record-ID: <urn:uuid:efa655dc-1404-11e6-a7dc-002590c8c43c>

<!DOCTYPE html>
//some html code

etc...
我创建了与此标题匹配的正则表达式:

REGULAR_EXPRESSION='WARC\/1\.0\nWARC-Type\:.*\nWARC-Date\:.*\nWARC-Payload-Digest:.*\nWARC-Target-URI:.*\nWARC-Record-ID:.*\n\n'
我不能将grep与-p参数一起使用。所以我不知道如何继续。也许是塞德?以及匹配正则表达式后的下一个问题。如何提取适当的信息


实现目标的最佳方法是什么?

使用awk处理这个问题更容易:

awk -F ': ' -v OFS='\t' 'NF>=2 {
   printf "%s%s", $2, ($1 != "WARC-Record-ID" ? OFS : ORS)}' file

response    2597724 2016-05-07T03:36:46Z    sha1:33a3973a118293e4f8831449cc37095d645a57b3   url application/http; msgtype=response  <urn:uuid:ecc531d0-1404-11e6-a7dc-002590c8c43c>
response    2106841 2016-05-07T03:36:51Z    sha1:826fcc2ef666e2cfbcff9e4329a293141077a20e   url application/http; msgtype=response  <urn:uuid:efa655dc-1404-11e6-a7dc-002590c8c43c>
awk-F':'-vofs='\t''NF>=2{
printf“%s%s”,$2,($1!=“WARC记录ID”?OFS:ORS)}文件
响应2597724 2016-05-07T03:36:46Z sha1:33A3973A11893E4F8831449CC37095D645A57B3 url应用程序/http;msgtype=响应
响应2106841 2016-05-07T03:36:51Z sha1:826FCC2EF666E2CFF9E43293141077A20E url应用程序/http;msgtype=响应
awk解决方案:

awk -F': ' '/WARC-Type/{n=NR+6}NR<=n{ s="\t"; if(NR==n){n=0;s=ORS} printf "%s%s",$2,s }' file

awk-F':''/WARC Type/{n=NR+6}NR您的问题非常不清楚,您的示例没有告诉我们任何有助于避免错误匹配(任何匹配脚本中最困难的部分)的内容,但这就是您要做的吗

$ awk -v RS= -v FS='\n[^:]+: *' -v OFS='\t' 'sub(/^WARC\/[0-9.]+/,""){$1=$1; sub(OFS,""); print}' file
response        2597724 2016-05-07T03:36:46Z    sha1:33a3973a118293e4f8831449cc37095d645a57b3   url     application/http; msgtype=response      <urn:uuid:ecc531d0-1404-11e6-a7dc-002590c8c43c>
response        2106841 2016-05-07T03:36:51Z    sha1:826fcc2ef666e2cfbcff9e4329a293141077a20e   url     application/http; msgtype=response      <urn:uuid:efa655dc-1404-11e6-a7dc-002590c8c43c>
$awk-vrs=-vfs='\n[^::::+:*'-vofs='\t''sub(/^WARC\/[0-9.]+/,“){$1=$1;sub(OFS,”);print}文件
响应2597724 2016-05-07T03:36:46Z sha1:33A3973A11893E4F8831449CC37095D645A57B3 url应用程序/http;msgtype=响应
响应2106841 2016-05-07T03:36:51Z sha1:826FCC2EF666E2CFF9E43293141077A20E url应用程序/http;msgtype=响应

@anubhava很抱歉我的错误信息。我更新了我的问题。现在更清楚了。谢谢你的帮助。你的问题是个问题。告诉我们您试图解决的问题,而不是寻求解决方案的帮助,这似乎不是正确的方法。@JarrodRoberson我不这么认为。我写了-我不能使用带-P参数的grep。所以我不知道如何继续。我需要匹配我的正则表达式,然后通过管道(|)到另一个程序。我不需要找到正确的解决方案。我需要帮助,我应该使用哪些程序,仅此而已。@Joozty:这样行吗?
response    2597724 2016-05-07T03:36:46Z    sha1:33a3973a118293e4f8831449cc37095d645a57b3   url application/http; msgtype=response  <urn:uuid:ecc531d0-1404-11e6-a7dc-002590c8c43c>
response    2106841 2016-05-07T03:36:51Z    sha1:826fcc2ef666e2cfbcff9e4329a293141077a20e   url application/http; msgtype=response  <urn:uuid:efa655dc-1404-11e6-a7dc-002590c8c43c>
$ awk -v RS= -v FS='\n[^:]+: *' -v OFS='\t' 'sub(/^WARC\/[0-9.]+/,""){$1=$1; sub(OFS,""); print}' file
response        2597724 2016-05-07T03:36:46Z    sha1:33a3973a118293e4f8831449cc37095d645a57b3   url     application/http; msgtype=response      <urn:uuid:ecc531d0-1404-11e6-a7dc-002590c8c43c>
response        2106841 2016-05-07T03:36:51Z    sha1:826fcc2ef666e2cfbcff9e4329a293141077a20e   url     application/http; msgtype=response      <urn:uuid:efa655dc-1404-11e6-a7dc-002590c8c43c>