awk:解析文件并将数据与下一行进行比较并以csv格式打印的命令
我有以下的I/p,是空间分隔的。 第一列是时间戳,下一列是线程id 我想将o/p转换为csv文件 样本输入awk:解析文件并将数据与下一行进行比较并以csv格式打印的命令,awk,Awk,我有以下的I/p,是空间分隔的。 第一列是时间戳,下一列是线程id 我想将o/p转换为csv文件 样本输入 04/09/15,08:49:05.001210 [Dispatch#3 (0x1b3b738)] NOTI 04/09/15,08:49:05.118592 [Dispatch#0 (0x1b3b708)] NOTI 04/09/15,08:49:05.225846 [Dispatch#2 (0x1b3b728)] NOTI 04/09/15,08:49:05.3619
04/09/15,08:49:05.001210 [Dispatch#3 (0x1b3b738)] NOTI
04/09/15,08:49:05.118592 [Dispatch#0 (0x1b3b708)] NOTI
04/09/15,08:49:05.225846 [Dispatch#2 (0x1b3b728)] NOTI
04/09/15,08:49:05.361914 [Dispatch#1 (0x1b3b718)] NOTI
04/09/15,08:49:05.469372 [Dispatch#3 (0x1b3b738)] NOTI
04/09/15,08:49:05.569784 [Dispatch#0 (0x1b3b708)] NOTI
04/09/15,08:49:05.738324 [Dispatch#2 (0x1b3b728)] NOTI
04/09/15,08:49:05.851328 [Dispatch#1 (0x1b3b718)] NOTI
04/09/15,08:49:05.965042 [Dispatch#3 (0x1b3b738)] NOTI
04/09/15,08:49:06.041505 [Dispatch#0 (0x1b3b708)] NOTI
04/09/15,08:49:06.151353 [Dispatch#2 (0x1b3b728)] NOTI
04/09/15,08:49:07.814024 [Dispatch#1 (0xb29718)] NOTI
04/09/15,08:49:07.588469 [Dispatch#1 (0xb29718)] NOTI
04/09/15,08:49:07.371815 [Dispatch#0 (0xb29708)] NOTI
04/09/15,08:49:07.160045 [Dispatch#0 (0xb29708)] NOTI
04/09/15,08:49:07.979571 [Dispatch#0 (0xb29708)] NOTI
04/09/15,08:50:08.385921 [Dispatch#0 (0x120e708)] NOTI
04/09/15,08:50:08.450522 [Dispatch#3 (0x120e738)] NOTI
04/09/15,08:50:08.550118 [Dispatch#1 (0x120e718)] NOTI
04/09/15,08:50:08.600923 [Dispatch#0 (0x120e708)] NOTI
TimeStamp,Thread1,Thread2,Thread3,Thread4
04/09/15 08:49:05,2,2,2,3
04/09/15 08:49:06,1,0,1,0
04/09/15 08:49:07,3,2,0,0
04/09/15 08:49:08,2,1,0,1
csv格式的o/p
04/09/15,08:49:05.001210 [Dispatch#3 (0x1b3b738)] NOTI
04/09/15,08:49:05.118592 [Dispatch#0 (0x1b3b708)] NOTI
04/09/15,08:49:05.225846 [Dispatch#2 (0x1b3b728)] NOTI
04/09/15,08:49:05.361914 [Dispatch#1 (0x1b3b718)] NOTI
04/09/15,08:49:05.469372 [Dispatch#3 (0x1b3b738)] NOTI
04/09/15,08:49:05.569784 [Dispatch#0 (0x1b3b708)] NOTI
04/09/15,08:49:05.738324 [Dispatch#2 (0x1b3b728)] NOTI
04/09/15,08:49:05.851328 [Dispatch#1 (0x1b3b718)] NOTI
04/09/15,08:49:05.965042 [Dispatch#3 (0x1b3b738)] NOTI
04/09/15,08:49:06.041505 [Dispatch#0 (0x1b3b708)] NOTI
04/09/15,08:49:06.151353 [Dispatch#2 (0x1b3b728)] NOTI
04/09/15,08:49:07.814024 [Dispatch#1 (0xb29718)] NOTI
04/09/15,08:49:07.588469 [Dispatch#1 (0xb29718)] NOTI
04/09/15,08:49:07.371815 [Dispatch#0 (0xb29708)] NOTI
04/09/15,08:49:07.160045 [Dispatch#0 (0xb29708)] NOTI
04/09/15,08:49:07.979571 [Dispatch#0 (0xb29708)] NOTI
04/09/15,08:50:08.385921 [Dispatch#0 (0x120e708)] NOTI
04/09/15,08:50:08.450522 [Dispatch#3 (0x120e738)] NOTI
04/09/15,08:50:08.550118 [Dispatch#1 (0x120e718)] NOTI
04/09/15,08:50:08.600923 [Dispatch#0 (0x120e708)] NOTI
TimeStamp,Thread1,Thread2,Thread3,Thread4
04/09/15 08:49:05,2,2,2,3
04/09/15 08:49:06,1,0,1,0
04/09/15 08:49:07,3,2,0,0
04/09/15 08:49:08,2,1,0,1
所以我想打印每个线程在特定时间处理的记录数
因此,在上面的示例中,在04/09/15 08:49:07线程1(0x1b3b718)有3记录,线程2(0xb29718)有2记录,而线程3和4没有任何记录。
请建议是否可以通过awk命令获取此信息。如果我了解您试图正确执行的操作,那么
awk -F '[,.# ]+' -v OFS=, 'function ts() { return $1 " " $2 } function dump() { print saved, a[0]+0, a[1]+0, a[2]+0, a[3]+0 } BEGIN { print "TimeStamp", "Thread1", "Thread2", "Thread3", "Thread4" } ts() != saved { if(NR != 1) dump(); delete a; saved = ts() } { ++a[$5] } END { dump() }' filename
这是一种有点粗糙的方法
诀窍是使用字段分隔符regex[,.#]+
,将行拆分,以便时间戳位于字段1和2中,线程号位于字段5中。-v OFS=,
选项将输出字段分隔符设置为逗号,以便输出数据为CSV。然后:
function ts() { # function to build a full timestamp as it is printed
return $1 " " $2 # later
}
function dump() { # function to print a result line. The +0 is to force
# the fields to be numbers, in case one remained empty.
print saved, a[0]+0, a[1]+0, a[2]+0, a[3]+0
}
BEGIN { # in the beginning, print the header line.
print "TimeStamp", "Thread1", "Thread2", "Thread3", "Thread4"
}
ts() != saved { # if the timestamp changed:
if(NR != 1) dump() # if we're not just starting, print the result for
# the last block
delete a # discard counters
saved = ts() # save new timestamp
}
{ ++a[$5] } # increase the counter for the thread this line mentions
END { dump() } # and in the end, print the result for the last block.
附录re:comment:对于动态线程数,我们需要对文件进行两次传递。在第一步中,我们找出有多少线程,在第二步中我们打印。这是因为文件中第一秒的条目可能不会告诉我们所有线程的情况。由于这对于单行程序来说越来越不方便,请将以下代码放入一个文件中:
#!/usr/bin/awk -f
BEGIN {
FS = "[,.# ]+"
OFS = ","
}
function ts() {
return $1 " " $2
}
function dump() {
printf("%s", saved);
for(i = 0; i <= threads; ++i) {
printf("%s%d", OFS, a[i])
}
print ""
}
# NR == FNR is true only for the first pass.
NR == FNR {
threads = $5 > threads ? $5 : threads
next
}
FNR == 1 {
printf("TimeStamp");
for(i = 0; i <= threads; ++i) {
printf("%sThread%d", OFS, i + 1)
}
print "";
}
ts() != saved {
if(FNR != 1) {
dump()
}
delete a
saved = ts()
}
{ ++a[$5] }
END { dump() }
请注意,文件名必须给awk两次。它的工作原理几乎相同,只是在打印之前有一个过程可以找到最大的线程数,并且打印是在循环中完成的。输入中有4个以上的“线程”-我们如何知道哪些线程进入输出?在awk命令中是可能的。还有一个帮助,可以动态标识线程。在上面的例子中,我们有4个线程,但是我们也可以有更多的线程。因此,除了在命令中进行更改,它是否可以从文件本身识别出来?是的,但它需要在文件上遍历两次(因为第一个条目可能不会告诉我们所有线程的情况)。请参见编辑。