Shell 从命令行快速查找结构化文本数据?
假设我有一个可预测的文本文档,它由一些名为Shell 从命令行快速查找结构化文本数据?,shell,awk,data-structures,grep,Shell,Awk,Data Structures,Grep,假设我有一个可预测的文本文档,它由一些名为X:的ID和已知的属性组合构成,例如类别Y:,具有已知数量的实例(例如,在序列中的每个X:之后始终只有1个Y:): 我想检索所有蓝色物品的项目ID列表。我不在乎是否有重复的ID,只在乎文档中有哪些ID值。然后我想对列表进行排序,并与另一个结构完全相同的结构化文本文档中的蓝色内容ID列表进行比较(“两个文档共有哪些蓝色内容?”“文档1中有哪些蓝色内容,而文档2中没有?”) 我知道我可以非常轻松地对所有Y:BLUE行执行grep,但是对于每个这样的实例,我还
X:
的ID和已知的属性组合构成,例如类别Y:
,具有已知数量的实例(例如,在序列中的每个X:
之后始终只有1个Y:
):
我想检索所有蓝色物品的项目ID列表。我不在乎是否有重复的ID,只在乎文档中有哪些ID值。然后我想对列表进行排序,并与另一个结构完全相同的结构化文本文档中的蓝色内容ID列表进行比较(“两个文档共有哪些蓝色内容?”“文档1中有哪些蓝色内容,而文档2中没有?”)
我知道我可以非常轻松地对所有
Y:BLUE
行执行grep
,但是对于每个这样的实例,我还需要哪些额外的命令来查找前面的X:
,并将排序结果列表传递给diff
?自从AmiShell之后,我就没有频繁地使用命令行。。。对不起:(网上有这样的使用案例的食谱吗?< P>)让我们考虑下面2个输入文档:
$ more doc*
::::::::::::::
doc1
::::::::::::::
doc 1
X:1
# more data pertaining to item 37
# more data pertaining to item 37
# more data pertaining to item 37
Y:BLUE
# more serialized data items including exactly 1 occurrence of "Y:" per preceding "X:"
# more serialized data items including exactly 1 occurrence of "Y:" per preceding "X:"
# more serialized data items including exactly 1 occurrence of "Y:" per preceding "X:"
X:2
# more data pertaining to item 37
# more data pertaining to item 37
Y:BLUE
# more serialized data items including exactly 1 occurrence of "Y:" per preceding "X:"
# more serialized data items including exactly 1 occurrence of "Y:" per preceding "X:"
X:3
# more data pertaining to item 37
# more data pertaining to item 37
Y:RED
# more serialized data items including exactly 1 occurrence of "Y:" per preceding "X:"
# more serialized data items including exactly 1 occurrence of "Y:" per preceding "X:"
X:4
# more data pertaining to item 37
# more data pertaining to item 37
Y:BLUE
# more serialized data items including exactly 1 occurrence of "Y:" per preceding "X:"
# more serialized data items including exactly 1 occurrence of "Y:" per preceding "X:"
::::::::::::::
doc2
::::::::::::::
doc 2
X:4
# more data pertaining to item 37
# more data pertaining to item 37
# more data pertaining to item 37
Y:BLUE
# more serialized data items including exactly 1 occurrence of "Y:" per preceding "X:"
# more serialized data items including exactly 1 occurrence of "Y:" per preceding "X:"
# more serialized data items including exactly 1 occurrence of "Y:" per preceding "X:"
X:3
# more data pertaining to item 37
# more data pertaining to item 37
Y:BLUE
# more serialized data items including exactly 1 occurrence of "Y:" per preceding "X:"
# more serialized data items including exactly 1 occurrence of "Y:" per preceding "X:"
X:2
# more data pertaining to item 37
# more data pertaining to item 37
Y:RED
# more serialized data items including exactly 1 occurrence of "Y:" per preceding "X:"
# more serialized data items including exactly 1 occurrence of "Y:" per preceding "X:"
X:1
# more data pertaining to item 37
# more data pertaining to item 37
Y:BLUE
# more serialized data items including exactly 1 occurrence of "Y:" per preceding "X:"
# more serialized data items including exactly 1 occurrence of "Y:" per preceding "X:"
您可以对每个文档使用以下awk
命令来获取ID:
$ awk -F':' '/X:[0-9]+$/{tmp=$2}/Y:BLUE$/{a[NR]=tmp}END{asort(a); for(i in a){print a[i]}}' doc1
1
2
4
$ awk -F':' '/X:[0-9]+$/{tmp=$2}/Y:BLUE$/{a[NR]=tmp}END{asort(a); for(i in a){print a[i]}}' doc2
1
3
4
解释:
将-F':'
定义为字段分隔符::
将在/X:[0-9]+$/{tmp=$2}
变量中保存ID的值(假设ID仅由数字组成,并且行上没有其他内容),如果情况不适合您,您可以调整过滤正则表达式tmp
,以满足您的需要/X:[0-9]+$/
当我们到达一条带有模式的线时/Y:BLUE$/{a[NR]=tmp}
(假设:EOL刚好在Y:BLUE
之后),我们将保存在tmp中的值添加到数组中BLUE
- 在处理结束时,我们对数组进行排序并打印它,请注意,您在
awk-F':''/X:[0-9]+$/{tmp=$2}/Y:BLUE$/{print tmp}中更改了
命令,sort-nawk
$ diff <(awk -F':' '/X:[0-9]+$/{tmp=$2}/Y:BLUE$/{a[NR]=tmp}END{asort(a); for(i in a){print a[i]}}' doc1) <(awk -F':' '/X:[0-9]+$/{tmp=$2}/Y:BLUE$/{a[NR]=tmp}END{asort(a); for(i in a){print a[i]}}' doc2)
2c2
< 2
---
> 3
$diff你能再发一些这个文本文档吗,我似乎不太懂它的格式?也许你想要grep-a1y:BLUE
?类似于grep-E“^X:^Y:BLUE”| grep-B1“^Y:BLUE”的东西
?您的问题包括简洁、可测试的样本输入和涵盖您的用例的预期输出,以便我们可以尝试帮助您。这听起来正是Awk设计的目的,尽管如果速度很重要,您需要使用带有索引的数据库。您要操作的实际数据--示例输入、预期输出、清除bou基本条件,没有挥手。如果您没有尝试自己用代码来解决它,可能仍然太宽。
$ diff <(awk -F':' '/X:[0-9]+$/{tmp=$2}/Y:BLUE$/{a[NR]=tmp}END{asort(a); for(i in a){print a[i]}}' doc1) <(awk -F':' '/X:[0-9]+$/{tmp=$2}/Y:BLUE$/{a[NR]=tmp}END{asort(a); for(i in a){print a[i]}}' doc2)
2c2
< 2
---
> 3
$ comm -1 -2 <(awk -F':' '/X:[0-9]+$/{tmp=$2}/Y:BLUE$/{a[NR]=tmp}END{asort(a); for(i in a){print a[i]}}' doc1) <(awk -F':' '/X:[
0-9]+$/{tmp=$2}/Y:BLUE$/{a[NR]=tmp}END{asort(a); for(i in a){print a[i]}}' doc2)
1
4