awk:解析文件并将数据与下一行进行比较并以csv格式打印的命令

awk:解析文件并将数据与下一行进行比较并以csv格式打印的命令,awk,Awk,我有以下的I/p,是空间分隔的。 第一列是时间戳,下一列是线程id 我想将o/p转换为csv文件 样本输入 04/09/15,08:49:05.001210 [Dispatch#3 (0x1b3b738)] NOTI 04/09/15,08:49:05.118592 [Dispatch#0 (0x1b3b708)] NOTI 04/09/15,08:49:05.225846 [Dispatch#2 (0x1b3b728)] NOTI 04/09/15,08:49:05.3619

我有以下的I/p,是空间分隔的。 第一列是时间戳,下一列是线程id

我想将o/p转换为csv文件

样本输入

04/09/15,08:49:05.001210  [Dispatch#3 (0x1b3b738)] NOTI  
04/09/15,08:49:05.118592  [Dispatch#0 (0x1b3b708)] NOTI  
04/09/15,08:49:05.225846  [Dispatch#2 (0x1b3b728)] NOTI  
04/09/15,08:49:05.361914  [Dispatch#1 (0x1b3b718)] NOTI  
04/09/15,08:49:05.469372  [Dispatch#3 (0x1b3b738)] NOTI  
04/09/15,08:49:05.569784  [Dispatch#0 (0x1b3b708)] NOTI  
04/09/15,08:49:05.738324  [Dispatch#2 (0x1b3b728)] NOTI  
04/09/15,08:49:05.851328  [Dispatch#1 (0x1b3b718)] NOTI  
04/09/15,08:49:05.965042  [Dispatch#3 (0x1b3b738)] NOTI  
04/09/15,08:49:06.041505  [Dispatch#0 (0x1b3b708)] NOTI  
04/09/15,08:49:06.151353  [Dispatch#2 (0x1b3b728)] NOTI  
04/09/15,08:49:07.814024  [Dispatch#1 (0xb29718)] NOTI   
04/09/15,08:49:07.588469  [Dispatch#1 (0xb29718)] NOTI   
04/09/15,08:49:07.371815  [Dispatch#0 (0xb29708)] NOTI   
04/09/15,08:49:07.160045  [Dispatch#0 (0xb29708)] NOTI   
04/09/15,08:49:07.979571  [Dispatch#0 (0xb29708)] NOTI   
04/09/15,08:50:08.385921  [Dispatch#0 (0x120e708)] NOTI  
04/09/15,08:50:08.450522  [Dispatch#3 (0x120e738)] NOTI  
04/09/15,08:50:08.550118  [Dispatch#1 (0x120e718)] NOTI  
04/09/15,08:50:08.600923  [Dispatch#0 (0x120e708)] NOTI  
TimeStamp,Thread1,Thread2,Thread3,Thread4    
04/09/15 08:49:05,2,2,2,3    
04/09/15 08:49:06,1,0,1,0    
04/09/15 08:49:07,3,2,0,0    
04/09/15 08:49:08,2,1,0,1
csv格式的o/p

04/09/15,08:49:05.001210  [Dispatch#3 (0x1b3b738)] NOTI  
04/09/15,08:49:05.118592  [Dispatch#0 (0x1b3b708)] NOTI  
04/09/15,08:49:05.225846  [Dispatch#2 (0x1b3b728)] NOTI  
04/09/15,08:49:05.361914  [Dispatch#1 (0x1b3b718)] NOTI  
04/09/15,08:49:05.469372  [Dispatch#3 (0x1b3b738)] NOTI  
04/09/15,08:49:05.569784  [Dispatch#0 (0x1b3b708)] NOTI  
04/09/15,08:49:05.738324  [Dispatch#2 (0x1b3b728)] NOTI  
04/09/15,08:49:05.851328  [Dispatch#1 (0x1b3b718)] NOTI  
04/09/15,08:49:05.965042  [Dispatch#3 (0x1b3b738)] NOTI  
04/09/15,08:49:06.041505  [Dispatch#0 (0x1b3b708)] NOTI  
04/09/15,08:49:06.151353  [Dispatch#2 (0x1b3b728)] NOTI  
04/09/15,08:49:07.814024  [Dispatch#1 (0xb29718)] NOTI   
04/09/15,08:49:07.588469  [Dispatch#1 (0xb29718)] NOTI   
04/09/15,08:49:07.371815  [Dispatch#0 (0xb29708)] NOTI   
04/09/15,08:49:07.160045  [Dispatch#0 (0xb29708)] NOTI   
04/09/15,08:49:07.979571  [Dispatch#0 (0xb29708)] NOTI   
04/09/15,08:50:08.385921  [Dispatch#0 (0x120e708)] NOTI  
04/09/15,08:50:08.450522  [Dispatch#3 (0x120e738)] NOTI  
04/09/15,08:50:08.550118  [Dispatch#1 (0x120e718)] NOTI  
04/09/15,08:50:08.600923  [Dispatch#0 (0x120e708)] NOTI  
TimeStamp,Thread1,Thread2,Thread3,Thread4    
04/09/15 08:49:05,2,2,2,3    
04/09/15 08:49:06,1,0,1,0    
04/09/15 08:49:07,3,2,0,0    
04/09/15 08:49:08,2,1,0,1
所以我想打印每个线程在特定时间处理的记录数

因此,在上面的示例中,在04/09/15 08:49:07线程1(0x1b3b718)有3记录,线程2(0xb29718)有2记录,而线程3和4没有任何记录。


请建议是否可以通过awk命令获取此信息。

如果我了解您试图正确执行的操作,那么

awk -F '[,.# ]+' -v OFS=, 'function ts() { return $1 " " $2 } function dump() { print saved, a[0]+0, a[1]+0, a[2]+0, a[3]+0 } BEGIN { print "TimeStamp", "Thread1", "Thread2", "Thread3", "Thread4" } ts() != saved { if(NR != 1) dump(); delete a; saved = ts() } { ++a[$5] } END { dump() }' filename
这是一种有点粗糙的方法

诀窍是使用字段分隔符regex
[,.#]+
,将行拆分,以便时间戳位于字段1和2中,线程号位于字段5中。
-v OFS=,
选项将输出字段分隔符设置为逗号,以便输出数据为CSV。然后:

function ts() {       # function to build a full timestamp as it is printed
  return $1 " " $2    # later
}

function dump() {     # function to print a result line. The +0 is to force
                      # the fields to be numbers, in case one remained empty.
  print saved, a[0]+0, a[1]+0, a[2]+0, a[3]+0
}

BEGIN {               # in the beginning, print the header line.
  print "TimeStamp", "Thread1", "Thread2", "Thread3", "Thread4"
} 

ts() != saved {       # if the timestamp changed:
  if(NR != 1) dump()  # if we're not just starting, print the result for
                      # the last block
  delete a            # discard counters
  saved = ts()        # save new timestamp
}
{ ++a[$5] }           # increase the counter for the thread this line mentions
END { dump() }        # and in the end, print the result for the last block.
附录re:comment:对于动态线程数,我们需要对文件进行两次传递。在第一步中,我们找出有多少线程,在第二步中我们打印。这是因为文件中第一秒的条目可能不会告诉我们所有线程的情况。由于这对于单行程序来说越来越不方便,请将以下代码放入一个文件中:

#!/usr/bin/awk -f

BEGIN {
  FS  = "[,.# ]+"
  OFS = ","
}

function ts() {
  return $1 " " $2
}

function dump() {
  printf("%s", saved);
  for(i = 0; i <= threads; ++i) {
    printf("%s%d", OFS, a[i])
  }
  print ""
}

# NR == FNR is true only for the first pass.    
NR == FNR {
  threads = $5 > threads ? $5 : threads
  next
}

FNR == 1 {
  printf("TimeStamp");
  for(i = 0; i <= threads; ++i) {
    printf("%sThread%d", OFS, i + 1)
  }
  print "";
} 

ts() != saved {
  if(FNR != 1) {
    dump()
  }

  delete a
  saved = ts()
}
{ ++a[$5] }
END { dump() }

请注意,文件名必须给awk两次。它的工作原理几乎相同,只是在打印之前有一个过程可以找到最大的线程数,并且打印是在循环中完成的。

输入中有4个以上的“线程”-我们如何知道哪些线程进入输出?在awk命令中是可能的。还有一个帮助,可以动态标识线程。在上面的例子中,我们有4个线程,但是我们也可以有更多的线程。因此,除了在命令中进行更改,它是否可以从文件本身识别出来?是的,但它需要在文件上遍历两次(因为第一个条目可能不会告诉我们所有线程的情况)。请参见编辑。