Linux 使用awk或sed的特定数据格式

Linux 使用awk或sed的特定数据格式,linux,bash,shell,awk,sed,Linux,Bash,Shell,Awk,Sed,我目前正在处理包含格式化为数据块的文件信息的大型数据集。我试图从文件路径行获取一段数据,并将其作为新列附加到某些行上。数据集包含如下格式的文件信息: File path: /d9b50a6f54d5a1f8/7b3d459a3454703c/a6d1040ea2c84e10/afcbe93ced71e5e6/2b517a561f5da8a6/aab17eb15d782d7b/af38f2bcc4998af0/0d8eb680024af333.jar Inode Num: 22525898 Chu

我目前正在处理包含格式化为数据块的文件信息的大型数据集。我试图从文件路径行获取一段数据,并将其作为新列附加到某些行上。数据集包含如下格式的文件信息:

File path: /d9b50a6f54d5a1f8/7b3d459a3454703c/a6d1040ea2c84e10/afcbe93ced71e5e6/2b517a561f5da8a6/aab17eb15d782d7b/af38f2bcc4998af0/0d8eb680024af333.jar
Inode Num: 22525898
Chunk Hash                      Chunk Size (bytes)      Compression Ratio (tenth)
45:97:2a:60:e3:69               3208                    10
7a:8b:8e:20:7b:38               1982                    10
b9:45:3d:f4:97:88               1849                    10
Whole File Hash: 865999b40fd9

File path: /d9b50a6f54d5a1f8/7b3d459a3454703c/a6d1040ea2c84e10/afcbe93ced71e5e6/2b517a561f5da8a6/1e82b13443330bb3/12fd3e87b2f62dc8/6e1a9f0b0a281564.c
Inode Num: 31881221
Chunk Hash                      Chunk Size (bytes)      Compression Ratio (tenth)
e8:b0:cb:6f:76:ff               1344                    10
19:c5:b2:aa:b3:60               613                     10
11:7c:7e:76:4b:d5               1272                    10
36:e0:59:49:b6:4a               581                     10
9c:31:bc:8a:39:94               3296                    10
01:f0:56:3a:e1:a9               1140                    10
Whole File Hash: 4b28b44ae03d
我想做的是获取文件类型(.jar和.c,在本例中)并将其附加到各自的块散列行中,以便最终的格式如下所示:

File path: /d9b50a6f54d5a1f8/7b3d459a3454703c/a6d1040ea2c84e10/afcbe93ced71e5e6/2b517a561f5da8a6/aab17eb15d782d7b/af38f2bcc4998af0/0d8eb680024af333.jar
Inode Num: 22525898
Chunk Hash                      Chunk Size (bytes)      Compression Ratio (tenth)       
45:97:2a:60:e3:69               3208                    10                              .jar
7a:8b:8e:20:7b:38               1982                    10                              .jar
b9:45:3d:f4:97:88               1849                    10                              .jar
Whole File Hash: 865999b40fd9

File path: /d9b50a6f54d5a1f8/7b3d459a3454703c/a6d1040ea2c84e10/afcbe93ced71e5e6/2b517a561f5da8a6/1e82b13443330bb3/12fd3e87b2f62dc8/6e1a9f0b0a281564.c
Inode Num: 31881221
Chunk Hash                      Chunk Size (bytes)      Compression Ratio (tenth)       
e8:b0:cb:6f:76:ff               1344                    10                              .c
19:c5:b2:aa:b3:60               613                     10                              .c
11:7c:7e:76:4b:d5               1272                    10                              .c
36:e0:59:49:b6:4a               581                     10                              .c
9c:31:bc:8a:39:94               3296                    10                              .c
01:f0:56:3a:e1:a9               1140                    10                              .c
Whole File Hash: 4b28b44ae03d
我已经有了用于提取文件类型和区块哈希行的awk代码:

awk 'match($0,/\..+/) {print substr($0,RSTART,RLENGTH)}'

awk '/Chunk Hash/{flag=1;next}/Whole File Hash:/{flag=0}flag'
我只是不知道如何使用awk(或sed)将这些片段连接起来,将文件类型作为一个新列附加到它们各自数据块中的每一行上。另一件需要注意的事情是,如果有区别的话,我会尝试在bash脚本中执行此操作。

语言中的解决方案:

@(repeat)
@  (cases)
File path: @*path.@suff
Inode Num: @inode
@header
@    (collect)
@hashline
@    (last)
Whole File Hash: @wfh
@    (end)
@    (output)
File path: @path.@suff
Inode Num: @inode
@header
@      (repeat)
@{hashline 88}.@suff
@      (end)
Whole File Hash: @wfh
@    (end)
@  (or)
@other
@  (do (put-line other))
@  (end)
@(end)
运行:

语言解决方案:

@(repeat)
@  (cases)
File path: @*path.@suff
Inode Num: @inode
@header
@    (collect)
@hashline
@    (last)
Whole File Hash: @wfh
@    (end)
@    (output)
File path: @path.@suff
Inode Num: @inode
@header
@      (repeat)
@{hashline 88}.@suff
@      (end)
Whole File Hash: @wfh
@    (end)
@  (or)
@other
@  (do (put-line other))
@  (end)
@(end)
运行:

以下是一个(GNU)sed解决方案:

/File path:/ {         # If line matches "File path:"
    h                  # Copy pattern space to hold space
    s/.*(\.[^.]*)$/\1/ # Remove everything but extension from pattern space
    x                  # Swap pattern space and hold space
}                      # Hold space now contains extension
/Chunk Hash/ {         # If line matches "Chunk Hash"
    n                  # Get next line into pattern space
    :loop              # Anchor for loop
    /Whole File Hash/b # If line matches "Whole File Hash", jump out of loop
    G                  # Append extension from hold space to pattern space
    s/\n/\t\t\t\t/     # Substitute newline with a bunch of tabs
    n                  # Get next line
    b loop             # Jump back to ":loop" label
}
这可以存储在一个单独的文件中(比如,
so.sed
),并且必须像这样调用

sed -r -f so.sed infile
导致

File path: /d9b50a6f54d5a1f8/7b3d459a3454703c/a6d1040ea2c84e10/afcbe93ced71e5e6/2b517a561f5da8a6/aab17eb15d782d7b/af38f2bcc4998af0/0d8eb680024af333.jar
Inode Num: 22525898
Chunk Hash                      Chunk Size (bytes)      Compression Ratio (tenth)
45:97:2a:60:e3:69               3208                    10                              .jar
7a:8b:8e:20:7b:38               1982                    10                              .jar
b9:45:3d:f4:97:88               1849                    10                              .jar
Whole File Hash: 865999b40fd9

File path: /d9b50a6f54d5a1f8/7b3d459a3454703c/a6d1040ea2c84e10/afcbe93ced71e5e6/2b517a561f5da8a6/1e82b13443330bb3/12fd3e87b2f62dc8/6e1a9f0b0a281564.c
Inode Num: 31881221
Chunk Hash                      Chunk Size (bytes)      Compression Ratio (tenth)
e8:b0:cb:6f:76:ff               1344                    10                              .c
19:c5:b2:aa:b3:60               613                     10                              .c
11:7c:7e:76:4b:d5               1272                    10                              .c
36:e0:59:49:b6:4a               581                     10                              .c
9c:31:bc:8a:39:94               3296                    10                              .c
01:f0:56:3a:e1:a9               1140                    10                              .c
Whole File Hash: 4b28b44ae03d
非GNU sed必须跳转以插入选项卡,并且不能使用
-r
选项(但可能
-E
,这在这里应该是等效的;
-r
只是为了方便必须退出
()
)。

这里是一个(GNU)sed解决方案:

/File path:/ {         # If line matches "File path:"
    h                  # Copy pattern space to hold space
    s/.*(\.[^.]*)$/\1/ # Remove everything but extension from pattern space
    x                  # Swap pattern space and hold space
}                      # Hold space now contains extension
/Chunk Hash/ {         # If line matches "Chunk Hash"
    n                  # Get next line into pattern space
    :loop              # Anchor for loop
    /Whole File Hash/b # If line matches "Whole File Hash", jump out of loop
    G                  # Append extension from hold space to pattern space
    s/\n/\t\t\t\t/     # Substitute newline with a bunch of tabs
    n                  # Get next line
    b loop             # Jump back to ":loop" label
}
这可以存储在一个单独的文件中(比如,
so.sed
),并且必须像这样调用

sed -r -f so.sed infile
导致

File path: /d9b50a6f54d5a1f8/7b3d459a3454703c/a6d1040ea2c84e10/afcbe93ced71e5e6/2b517a561f5da8a6/aab17eb15d782d7b/af38f2bcc4998af0/0d8eb680024af333.jar
Inode Num: 22525898
Chunk Hash                      Chunk Size (bytes)      Compression Ratio (tenth)
45:97:2a:60:e3:69               3208                    10                              .jar
7a:8b:8e:20:7b:38               1982                    10                              .jar
b9:45:3d:f4:97:88               1849                    10                              .jar
Whole File Hash: 865999b40fd9

File path: /d9b50a6f54d5a1f8/7b3d459a3454703c/a6d1040ea2c84e10/afcbe93ced71e5e6/2b517a561f5da8a6/1e82b13443330bb3/12fd3e87b2f62dc8/6e1a9f0b0a281564.c
Inode Num: 31881221
Chunk Hash                      Chunk Size (bytes)      Compression Ratio (tenth)
e8:b0:cb:6f:76:ff               1344                    10                              .c
19:c5:b2:aa:b3:60               613                     10                              .c
11:7c:7e:76:4b:d5               1272                    10                              .c
36:e0:59:49:b6:4a               581                     10                              .c
9c:31:bc:8a:39:94               3296                    10                              .c
01:f0:56:3a:e1:a9               1140                    10                              .c
Whole File Hash: 4b28b44ae03d
非GNU SED必须跳转以插入选项卡,并且不能使用
-r
选项(但可能
-E
,这在这里应该是等效的;
-r
只是为了方便退出
()
)。

在awk中:

$ cat script.awk
/File path/ { 
    match($0,/\..+/)
    ext=substr($0,RSTART,RLENGTH)
} 
/Chunk Hash/ {
    flag=1            # flag on
    print             # print here to...
    next              # avoid printing ext
} 
/Whole File Hash:/ {  
    flag=0            # flag off
} 
flag==1 {
    print $0, ext     # add space here to your liking, left it short...
    next              # ... to show output on screen without sidescrolling
} 1                   # print non-flagged records
运行:

在awk中:

$ cat script.awk
/File path/ { 
    match($0,/\..+/)
    ext=substr($0,RSTART,RLENGTH)
} 
/Chunk Hash/ {
    flag=1            # flag on
    print             # print here to...
    next              # avoid printing ext
} 
/Whole File Hash:/ {  
    flag=0            # flag off
} 
flag==1 {
    print $0, ext     # add space here to your liking, left it short...
    next              # ... to show output on screen without sidescrolling
} 1                   # print non-flagged records
运行:


有些行加倍了,您应该从地址范围块中删除
p
命令。@Kenavoz哇,是的,
n
打印时没有
-n
选项。。。谢谢有些行加倍了,您应该从地址范围块中删除
p
命令。@Kenavoz哇,是的,
n
打印时没有
-n
选项。。。谢谢对不起,我的英语不好,很难表达。我会试着写一些解释。对不起,我的英语不好,很难表达。我会试着写一些解释。