Shell 从命令行快速查找结构化文本数据?

Shell 从命令行快速查找结构化文本数据?,shell,awk,data-structures,grep,Shell,Awk,Data Structures,Grep,假设我有一个可预测的文本文档,它由一些名为X:的ID和已知的属性组合构成,例如类别Y:,具有已知数量的实例(例如,在序列中的每个X:之后始终只有1个Y:): 我想检索所有蓝色物品的项目ID列表。我不在乎是否有重复的ID,只在乎文档中有哪些ID值。然后我想对列表进行排序,并与另一个结构完全相同的结构化文本文档中的蓝色内容ID列表进行比较(“两个文档共有哪些蓝色内容?”“文档1中有哪些蓝色内容,而文档2中没有?”) 我知道我可以非常轻松地对所有Y:BLUE行执行grep,但是对于每个这样的实例,我还

假设我有一个可预测的文本文档,它由一些名为
X:
的ID和已知的属性组合构成,例如类别
Y:
,具有已知数量的实例(例如,在序列中的每个
X:
之后始终只有1个
Y:
):

我想检索所有蓝色物品的项目ID列表。我不在乎是否有重复的ID,只在乎文档中有哪些ID值。然后我想对列表进行排序,并与另一个结构完全相同的结构化文本文档中的蓝色内容ID列表进行比较(“两个文档共有哪些蓝色内容?”“文档1中有哪些蓝色内容,而文档2中没有?”)


我知道我可以非常轻松地对所有
Y:BLUE
行执行
grep
,但是对于每个这样的实例,我还需要哪些额外的命令来查找前面的
X:
,并将排序结果列表传递给
diff
?自从AmiShell之后,我就没有频繁地使用命令行。。。对不起:(网上有这样的使用案例的食谱吗?

< P>)让我们考虑下面2个输入文档:

$ more doc*
::::::::::::::
doc1
::::::::::::::
doc 1
  X:1
#  more data pertaining to item 37
#  more data pertaining to item 37
#  more data pertaining to item 37
  Y:BLUE
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"
  X:2
#  more data pertaining to item 37
#  more data pertaining to item 37
  Y:BLUE
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"
  X:3
#  more data pertaining to item 37
#  more data pertaining to item 37
  Y:RED
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"
  X:4
#  more data pertaining to item 37
#  more data pertaining to item 37
  Y:BLUE
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"
::::::::::::::
doc2
::::::::::::::
doc 2
  X:4
#  more data pertaining to item 37
#  more data pertaining to item 37
#  more data pertaining to item 37
  Y:BLUE
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"
  X:3
#  more data pertaining to item 37
#  more data pertaining to item 37
  Y:BLUE
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"
  X:2
#  more data pertaining to item 37
#  more data pertaining to item 37
  Y:RED
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"
  X:1
#  more data pertaining to item 37
#  more data pertaining to item 37
  Y:BLUE
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"
您可以对每个文档使用以下
awk
命令来获取ID:

$ awk -F':' '/X:[0-9]+$/{tmp=$2}/Y:BLUE$/{a[NR]=tmp}END{asort(a); for(i in a){print a[i]}}' doc1
1
2
4

$ awk -F':' '/X:[0-9]+$/{tmp=$2}/Y:BLUE$/{a[NR]=tmp}END{asort(a); for(i in a){print a[i]}}' doc2
1
3
4
解释:

  • -F':'
    定义为字段分隔符:
  • /X:[0-9]+$/{tmp=$2}
    将在
    tmp
    变量中保存ID的值(假设ID仅由数字组成,并且行上没有其他内容),如果情况不适合您,您可以调整过滤正则表达式
    /X:[0-9]+$/
    ,以满足您的需要
  • /Y:BLUE$/{a[NR]=tmp}
    当我们到达一条带有模式的线时
    Y:BLUE
    (假设:EOL刚好在
    BLUE
    之后),我们将保存在tmp中的值添加到数组中
  • 在处理结束时,我们对数组进行排序并打印它,请注意,您在
    awk-F':''/X:[0-9]+$/{tmp=$2}/Y:BLUE$/{print tmp}中更改了
    awk
    命令,sort-n
然后,您可以通过以下方式组合它们,以查找两个文档之间蓝色ID的差异:

$ diff <(awk -F':' '/X:[0-9]+$/{tmp=$2}/Y:BLUE$/{a[NR]=tmp}END{asort(a); for(i in a){print a[i]}}' doc1) <(awk -F':' '/X:[0-9]+$/{tmp=$2}/Y:BLUE$/{a[NR]=tmp}END{asort(a); for(i in a){print a[i]}}' doc2)                                                    
2c2
< 2
---
> 3

$diff你能再发一些这个文本文档吗,我似乎不太懂它的格式?也许你想要
grep-a1y:BLUE
?类似于
grep-E“^X:^Y:BLUE”| grep-B1“^Y:BLUE”的东西
?您的问题包括简洁、可测试的样本输入和涵盖您的用例的预期输出,以便我们可以尝试帮助您。这听起来正是Awk设计的目的,尽管如果速度很重要,您需要使用带有索引的数据库。您要操作的实际数据--示例输入、预期输出、清除bou基本条件,没有挥手。如果您没有尝试自己用代码来解决它,可能仍然太宽。
$ diff <(awk -F':' '/X:[0-9]+$/{tmp=$2}/Y:BLUE$/{a[NR]=tmp}END{asort(a); for(i in a){print a[i]}}' doc1) <(awk -F':' '/X:[0-9]+$/{tmp=$2}/Y:BLUE$/{a[NR]=tmp}END{asort(a); for(i in a){print a[i]}}' doc2)                                                    
2c2
< 2
---
> 3
$ comm -1 -2 <(awk -F':' '/X:[0-9]+$/{tmp=$2}/Y:BLUE$/{a[NR]=tmp}END{asort(a); for(i in a){print a[i]}}' doc1) <(awk -F':' '/X:[
0-9]+$/{tmp=$2}/Y:BLUE$/{a[NR]=tmp}END{asort(a); for(i in a){print a[i]}}' doc2)                                              
1
4