Bash AWK:如何删除列计数与页眉计数不同且数据在双引号内有逗号的数据
我需要删除列计数与标题列计数不同的数据行。以下操作适用于字段包含数据的情况,该字段中的逗号在双引号内。有什么好办法吗Bash AWK:如何删除列计数与页眉计数不同且数据在双引号内有逗号的数据,bash,awk,Bash,Awk,我需要删除列计数与标题列计数不同的数据行。以下操作适用于字段包含数据的情况,该字段中的逗号在双引号内。有什么好办法吗 cleanColumns=$(awk -F, 'NR==1{ count=NF; } NF==count' testData.txt); echo "$cleanColumns" > noErrors.tx 以前 timeStamp,elapsed,label,responseCode,responseMessage,dataType,success,bytes,grp
cleanColumns=$(awk -F, 'NR==1{ count=NF; } NF==count' testData.txt);
echo "$cleanColumns" > noErrors.tx
以前
timeStamp,elapsed,label,responseCode,responseMessage,dataType,success,bytes,grpThreads,allThreads,Latency,Hostname,IdleTime,Connect
1459774220811,2018,Fizz_Homepage_2,403," transaction : 1,failing samples : 0",,false,12928,2,2,0,HOST1,5008,0
1459774225103,3485,Fizz_Launch_1,200," transaction : 1,failing samples : 0",,true,1138878,2,2,0,HOST1,5022,0
1459774227844,1653,Fizz_Launch_1,200," transaction : 1,failing samples : 0",,true,18792,2,2,0,HOST1,5012,0
1459774227844,1653,Fizz_Launch_1,200," transaction : 1,failing samples : 0",,true,18792,2,2,0,HOST1,
1459774227844,1653,Fizz_Launch_1,200," transaction : 1,failing samples : 0",,true,
之后
1459774220811,2018,Fizz_Homepage_2,403," transaction : 1,failing samples : 0",,false,12928,2,2,0,HOST1,5008,0
1459774225103,3485,Fizz_Launch_1,200," transaction : 1,failing samples : 0",,true,1138878,2,2,0,HOST1,5022,0
1459774227844,1653,Fizz_Launch_1,200," transaction : 1,failing samples : 0",,true,18792,2,2,0,HOST1,5012,0
如果您有
gawk
,您可以设置FPAT变量来定义字段(而不是字段分隔符)
比如说
gawk -v FPAT="([^,]+)|(\"[^\"]+\")" 'NR==1{count=NF} NF==count' file
如果您有
gawk
,您可以设置FPAT变量来定义字段(而不是字段分隔符)
比如说
gawk -v FPAT="([^,]+)|(\"[^\"]+\")" 'NR==1{count=NF} NF==count' file
对于FPAT,使用GNU awk:
$ awk -v FPAT='[^,]*|"[^"]+"' 'NR==1{nf=NF} NF==nf' file
timeStamp,elapsed,label,responseCode,responseMessage,dataType,success,bytes,grpThreads,allThreads,Latency,Hostname,IdleTime,Connect
1459774220811,2018,Fizz_Homepage_2,403," transaction : 1,failing samples : 0",,false,12928,2,2,0,HOST1,5008,0
1459774225103,3485,Fizz_Launch_1,200," transaction : 1,failing samples : 0",,true,1138878,2,2,0,HOST1,5022,0
1459774227844,1653,Fizz_Launch_1,200," transaction : 1,failing samples : 0",,true,18792,2,2,0,HOST1,5012,0
对于其他AWK,您需要使用相同的正则表达式进行while(match())
循环,例如:
$ cat tst.awk
BEGIN { FS=RS; OFS="," }
{
head = ""
tail = $0
while( (tail!="") && match(tail,/[^,]*|"[^"]+"/) ) {
head = head (head==""?"":FS) substr(tail,RSTART,RLENGTH)
tail = substr(tail,RSTART+RLENGTH+1)
}
$0 = head tail
}
NR==1 { nf=NF }
NF==nf { $1=$1; print }
$
$ awk -f tst.awk file
timeStamp,elapsed,label,responseCode,responseMessage,dataType,success,bytes,grpThreads,allThreads,Latency,Hostname,IdleTime,Connect
1459774220811,2018,Fizz_Homepage_2,403," transaction : 1,failing samples : 0",,false,12928,2,2,0,HOST1,5008,0
1459774225103,3485,Fizz_Launch_1,200," transaction : 1,failing samples : 0",,true,1138878,2,2,0,HOST1,5022,0
1459774227844,1653,Fizz_Launch_1,200," transaction : 1,failing samples : 0",,true,18792,2,2,0,HOST1,5012,0
上面的功能超出了您的需要,因为它创建了一个记录,其中每个字段首先用换行符分隔,然后在打印之前将换行符改回逗号-这不仅仅是为了获得可以写入循环或以其他方式访问字段等的字段数。就像您使用了FPAT
一样。以下是在没有GNU awk的情况下识别CSV文件中字段的一般方法:
$ cat tst.awk
{
csv2flds()
for (i=0;i<=NF;i++) {
print "NR="NR, "NF="NF, "$"i"="$i
}
print "-----"
}
function csv2flds( head, tail, ofs) {
ofs=OFS; OFS=","; FS=RS
head = ""
tail = $0
while( (tail!="") && match(tail,/[^,]*|"[^"]+"/) ) {
head = head (head==""?"":FS) substr(tail,RSTART,RLENGTH)
tail = substr(tail,RSTART+RLENGTH+1)
}
$0 = head tail # calculates NF and splits into fields using FS="\n"
$1 = $1 # converts "xFSy" into "xOFSy" so "x\ny" becomes "x,y"
FS=OFS; OFS=ofs
}
对于FPAT,使用GNU awk:
$ awk -v FPAT='[^,]*|"[^"]+"' 'NR==1{nf=NF} NF==nf' file
timeStamp,elapsed,label,responseCode,responseMessage,dataType,success,bytes,grpThreads,allThreads,Latency,Hostname,IdleTime,Connect
1459774220811,2018,Fizz_Homepage_2,403," transaction : 1,failing samples : 0",,false,12928,2,2,0,HOST1,5008,0
1459774225103,3485,Fizz_Launch_1,200," transaction : 1,failing samples : 0",,true,1138878,2,2,0,HOST1,5022,0
1459774227844,1653,Fizz_Launch_1,200," transaction : 1,failing samples : 0",,true,18792,2,2,0,HOST1,5012,0
对于其他AWK,您需要使用相同的正则表达式进行while(match())
循环,例如:
$ cat tst.awk
BEGIN { FS=RS; OFS="," }
{
head = ""
tail = $0
while( (tail!="") && match(tail,/[^,]*|"[^"]+"/) ) {
head = head (head==""?"":FS) substr(tail,RSTART,RLENGTH)
tail = substr(tail,RSTART+RLENGTH+1)
}
$0 = head tail
}
NR==1 { nf=NF }
NF==nf { $1=$1; print }
$
$ awk -f tst.awk file
timeStamp,elapsed,label,responseCode,responseMessage,dataType,success,bytes,grpThreads,allThreads,Latency,Hostname,IdleTime,Connect
1459774220811,2018,Fizz_Homepage_2,403," transaction : 1,failing samples : 0",,false,12928,2,2,0,HOST1,5008,0
1459774225103,3485,Fizz_Launch_1,200," transaction : 1,failing samples : 0",,true,1138878,2,2,0,HOST1,5022,0
1459774227844,1653,Fizz_Launch_1,200," transaction : 1,failing samples : 0",,true,18792,2,2,0,HOST1,5012,0
上面的功能超出了您的需要,因为它创建了一个记录,其中每个字段首先用换行符分隔,然后在打印之前将换行符改回逗号-这不仅仅是为了获得可以写入循环或以其他方式访问字段等的字段数。就像您使用了FPAT
一样。以下是在没有GNU awk的情况下识别CSV文件中字段的一般方法:
$ cat tst.awk
{
csv2flds()
for (i=0;i<=NF;i++) {
print "NR="NR, "NF="NF, "$"i"="$i
}
print "-----"
}
function csv2flds( head, tail, ofs) {
ofs=OFS; OFS=","; FS=RS
head = ""
tail = $0
while( (tail!="") && match(tail,/[^,]*|"[^"]+"/) ) {
head = head (head==""?"":FS) substr(tail,RSTART,RLENGTH)
tail = substr(tail,RSTART+RLENGTH+1)
}
$0 = head tail # calculates NF and splits into fields using FS="\n"
$1 = $1 # converts "xFSy" into "xOFSy" so "x\ny" becomes "x,y"
FS=OFS; OFS=ofs
}
使用CSV解析器的一个示例:类似perl的ruby one liner
ruby -rcsv -ne '
row = CSV.parse_line($_)
n = row.length if $. == 1
puts $_ if row.length == n
' filename
使用CSV解析器的一个示例:类似perl的ruby one liner
ruby -rcsv -ne '
row = CSV.parse_line($_)
n = row.length if $. == 1
puts $_ if row.length == n
' filename
使用为分析CSV而设计的工具。例如,Python或Perl都有CSV模块。例如,Python或Perl都有CSV模块。