Shell 从命令行快速查找结构化文本数据？_Shell_Awk_Data Structures_Grep

Shell 从命令行快速查找结构化文本数据？

shell awk data-structures grep

Shell 从命令行快速查找结构化文本数据？,shell,awk,data-structures,grep,Shell,Awk,Data Structures,Grep,假设我有一个可预测的文本文档，它由一些名为X:的ID和已知的属性组合构成，例如类别Y:，具有已知数量的实例（例如，在序列中的每个X:之后始终只有1个Y:）：我想检索所有蓝色物品的项目ID列表。我不在乎是否有重复的ID，只在乎文档中有哪些ID值。然后我想对列表进行排序，并与另一个结构完全相同的结构化文本文档中的蓝色内容ID列表进行比较（“两个文档共有哪些蓝色内容？”“文档1中有哪些蓝色内容，而文档2中没有？”）我知道我可以非常轻松地对所有Y:BLUE行执行grep，但是对于每个这样的实例，我还

假设我有一个可预测的文本文档，它由一些名为

X:

的ID和已知的属性组合构成，例如类别

Y:

，具有已知数量的实例（例如，在序列中的每个

X:

之后始终只有1个

Y:

）：

我想检索所有蓝色物品的项目ID列表。我不在乎是否有重复的ID，只在乎文档中有哪些ID值。然后我想对列表进行排序，并与另一个结构完全相同的结构化文本文档中的蓝色内容ID列表进行比较（“两个文档共有哪些蓝色内容？”“文档1中有哪些蓝色内容，而文档2中没有？”）

我知道我可以非常轻松地对所有

Y:BLUE

行执行

grep

，但是对于每个这样的实例，我还需要哪些额外的命令来查找前面的

X:

，并将排序结果列表传递给

diff

？自从AmiShell之后，我就没有频繁地使用命令行。。。对不起：（网上有这样的使用案例的食谱吗？

< P>）让我们考虑下面2个输入文档：

$ more doc*
::::::::::::::
doc1
::::::::::::::
doc 1
  X:1
#  more data pertaining to item 37
#  more data pertaining to item 37
#  more data pertaining to item 37
  Y:BLUE
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"
  X:2
#  more data pertaining to item 37
#  more data pertaining to item 37
  Y:BLUE
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"
  X:3
#  more data pertaining to item 37
#  more data pertaining to item 37
  Y:RED
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"
  X:4
#  more data pertaining to item 37
#  more data pertaining to item 37
  Y:BLUE
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"
::::::::::::::
doc2
::::::::::::::
doc 2
  X:4
#  more data pertaining to item 37
#  more data pertaining to item 37
#  more data pertaining to item 37
  Y:BLUE
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"
  X:3
#  more data pertaining to item 37
#  more data pertaining to item 37
  Y:BLUE
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"
  X:2
#  more data pertaining to item 37
#  more data pertaining to item 37
  Y:RED
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"
  X:1
#  more data pertaining to item 37
#  more data pertaining to item 37
  Y:BLUE
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"

您可以对每个文档使用以下

awk

命令来获取ID：

$ awk -F':' '/X:[0-9]+$/{tmp=$2}/Y:BLUE$/{a[NR]=tmp}END{asort(a); for(i in a){print a[i]}}' doc1
1
2
4

$ awk -F':' '/X:[0-9]+$/{tmp=$2}/Y:BLUE$/{a[NR]=tmp}END{asort(a); for(i in a){print a[i]}}' doc2
1
3
4

解释：

```
-F'：'
```
将
```
：
```
定义为字段分隔符：
```
/X:[0-9]+$/{tmp=$2}
```
将在
```
tmp
```
变量中保存ID的值（假设ID仅由数字组成，并且行上没有其他内容），如果情况不适合您，您可以调整过滤正则表达式
```
/X:[0-9]+$/
```
，以满足您的需要
```
/Y:BLUE$/{a[NR]=tmp}
```
当我们到达一条带有模式的线时
```
Y:BLUE
```
（假设：EOL刚好在
```
BLUE
```
之后），我们将保存在tmp中的值添加到数组中

在处理结束时，我们对数组进行排序并打印它，请注意，您在

awk-F'：''/X:[0-9]+$/{tmp=$2}/Y:BLUE$/{print tmp}中更改了awk
命令，sort-n

然后，您可以通过以下方式组合它们，以查找两个文档之间蓝色ID的差异：

$ diff <(awk -F':' '/X:[0-9]+$/{tmp=$2}/Y:BLUE$/{a[NR]=tmp}END{asort(a); for(i in a){print a[i]}}' doc1) <(awk -F':' '/X:[0-9]+$/{tmp=$2}/Y:BLUE$/{a[NR]=tmp}END{asort(a); for(i in a){print a[i]}}' doc2)                                                    
2c2
< 2
---
> 3

$diff你能再发一些这个文本文档吗，我似乎不太懂它的格式？也许你想要grep-a1y:BLUE
？类似于grep-E“^X:^Y:BLUE”| grep-B1“^Y:BLUE”的东西？您的问题包括简洁、可测试的样本输入和涵盖您的用例的预期输出，以便我们可以尝试帮助您。这听起来正是Awk设计的目的，尽管如果速度很重要，您需要使用带有索引的数据库。您要操作的实际数据--示例输入、预期输出、清除bou基本条件，没有挥手。如果您没有尝试自己用代码来解决它，可能仍然太宽。
$ diff <(awk -F':' '/X:[0-9]+$/{tmp=$2}/Y:BLUE$/{a[NR]=tmp}END{asort(a); for(i in a){print a[i]}}' doc1) <(awk -F':' '/X:[0-9]+$/{tmp=$2}/Y:BLUE$/{a[NR]=tmp}END{asort(a); for(i in a){print a[i]}}' doc2)                                                    
2c2
< 2
---
> 3

$ comm -1 -2 <(awk -F':' '/X:[0-9]+$/{tmp=$2}/Y:BLUE$/{a[NR]=tmp}END{asort(a); for(i in a){print a[i]}}' doc1) <(awk -F':' '/X:[
0-9]+$/{tmp=$2}/Y:BLUE$/{a[NR]=tmp}END{asort(a); for(i in a){print a[i]}}' doc2)                                              
1
4