Unix awk比较两个文件并打印格式化输出

Unix awk比较两个文件并打印格式化输出,unix,awk,Unix,Awk,我想根据每个文件的第一个字段$1比较两个文件 然后从两个文件中填充匹配行-(在Aug.csv和Sep.csv中提供),并将最后一个字段备注打印为“匹配” 从Aug.csv(在Aug.csv中可用,在Sep.csv中不可用)和print Not found(即“Not”)的非匹配行,相当于字段数量的5倍($NF) 在Sep.csv文件“NOT,NOT,NOT,NOT”中,将最后一个字段备注打印为“NOT in Sep.csv”或文件名 Sep.csv中的非匹配行-(Sep.csv中可用,而Aug.

我想根据每个文件的第一个字段
$1
比较两个文件

然后从两个文件中填充匹配行-(在Aug.csv和Sep.csv中提供),并将最后一个字段备注打印为“匹配”

从Aug.csv(在Aug.csv中可用,在Sep.csv中不可用)和print Not found(即“Not”)的非匹配行,相当于字段数量的5倍($NF) 在Sep.csv文件“NOT,NOT,NOT,NOT”中,将最后一个字段备注打印为“NOT in Sep.csv”或文件名

Sep.csv中的非匹配行-(Sep.csv中可用,而Aug.csv中不可用)和未找到的打印行(即“Not”)相当于字段数量的4倍($NF) 在Aug.csv文件“NOT,NOT,NOT,NOT”中,并将上次提交的备注打印为“NOT in Aug.csv”或文件名

八月

Name,Age,Place,Des
aaa,40,xxx,Aug
aaa,20,yyy,Aug
ccc,35,xxx,Aug
九月

Name,Age,Place,Edu,Des
aaa,50,zzz,eee,Sep
bbb,30,xxx,yyy,Sep
aaa,60,yyy,fff,Sep
bbb,50,yyy,fff,Sep
预期输出.csv

Name,Age,Place,Des,Name,Age,Place,Edu,Des,Remarks
aaa,40,xxx,Aug,aaa,50,zzz,eee,Sep,Matched
aaa,40,xxx,Aug,aaa,60,yyy,fff,Sep,Matched
aaa,20,yyy,Aug,aaa,50,zzz,eee,Sep,Matched
aaa,20,yyy,Aug,aaa,60,yyy,fff,Sep,Matched
NOT,NOT,NOT,NOT,bbb,30,xxx,yyy,Sep,Not in Aug.csv
NOT,NOT,NOT,NOT,bbb,50,yyy,fff,Sep,Not in Aug.csv
ccc,35,xxx,Aug,NOT,NOT,NOT,NOT,NOT,Not in Sep.csv
Name,Age,Place,Des,Name,Age,Place,Edu,Des,Remarks
aaa,40,xxx,Aug,aaa,50,zzz,eee,Sep,Matched
aaa,40,xxx,Aug,aaa,60,yyy,fff,Sep,Matched
aaa,20,yyy,Aug,aaa,50,zzz,eee,Sep,Matched
aaa,20,yyy,Aug,aaa,60,yyy,fff,Sep,Matched
,,,,bbb,30,xxx,yyy,Sep,Not in Aug.csv
,,,,bbb,50,yyy,fff,Sep,Not in Aug.csv
ccc,35,xxx,Aug,,,,,,Not in Sep.csv
我尝试了以下两个命令以获得所需的输出,但没有成功

第一命令:

 awk -v first="NOT,NOT,NOT,NOT"  -v second="NOT,NOT,NOT,NOT,NOT" -F"," 'NR==FNR{a[$1]=$0;next}{if (a[$1])print a[$1],$0,"Matched";else print first, $0,"Not in Aug.csv";}' OFS="," Aug.csv Sep.csv >Output.csv
第二命令:

awk -v first="NOT,NOT,NOT,NOT"  -v second="NOT,NOT,NOT,NOT,NOT" -F"," 'NR==FNR{a[$1]=$0;next} !($1 in a) {print $0,second,"Not in Sep.csv";}' OFS="," Sep.csv Aug.csv  >>Output.csv  
从上面的命令中获得下面的Output.csv

Name,Age,Place,Des,Name,Age,Place,Edu,Des,Matched
aaa,20,yyy,Aug,aaa,50,zzz,eee,Sep,Matched
aaa,20,yyy,Aug,aaa,60,yyy,fff,Sep,Matched
NOT,NOT,NOT,NOT,bbb,30,xxx,yyy,Sep,Not in Aug.csv
NOT,NOT,NOT,NOT,bbb,50,yyy,fff,Sep,Not in Aug.csv
ccc,35,xxx,Aug,NOT,NOT,NOT,NOT,NOT,Not in Sep.csv
在这里,我缺少预期输出中的以下两个匹配行(Aug.csv)。请告知如何处理这个。。。它似乎忽略了重复的条目

aaa,40,xxx,Aug,aaa,50,zzz,eee,Sep,Matched
aaa,40,xxx,Aug,aaa,60,yyy,fff,Sep,Matched
想知道根据Aug.csv和Sep.csv中可用的字段/标题的数量,该变量如何成为动态变量
“$first”
和“
$second”
(即
awk-v first=“NOT,NOT,NOT,NOT”-v second=“NOT,NOT,NOT,NOT”
) 因为在原始文件中包含更多的字段,并且过去每次都会有10个字段、15个字段等变化。。。不想手动键入10次“NOT” 或者根据原始文件中的字段数量,是否有任何方法可以在打印
“FS”
时重复
功能。
这样我的输出将被格式化如下

预期输出.csv

Name,Age,Place,Des,Name,Age,Place,Edu,Des,Remarks
aaa,40,xxx,Aug,aaa,50,zzz,eee,Sep,Matched
aaa,40,xxx,Aug,aaa,60,yyy,fff,Sep,Matched
aaa,20,yyy,Aug,aaa,50,zzz,eee,Sep,Matched
aaa,20,yyy,Aug,aaa,60,yyy,fff,Sep,Matched
NOT,NOT,NOT,NOT,bbb,30,xxx,yyy,Sep,Not in Aug.csv
NOT,NOT,NOT,NOT,bbb,50,yyy,fff,Sep,Not in Aug.csv
ccc,35,xxx,Aug,NOT,NOT,NOT,NOT,NOT,Not in Sep.csv
Name,Age,Place,Des,Name,Age,Place,Edu,Des,Remarks
aaa,40,xxx,Aug,aaa,50,zzz,eee,Sep,Matched
aaa,40,xxx,Aug,aaa,60,yyy,fff,Sep,Matched
aaa,20,yyy,Aug,aaa,50,zzz,eee,Sep,Matched
aaa,20,yyy,Aug,aaa,60,yyy,fff,Sep,Matched
,,,,bbb,30,xxx,yyy,Sep,Not in Aug.csv
,,,,bbb,50,yyy,fff,Sep,Not in Aug.csv
ccc,35,xxx,Aug,,,,,,Not in Sep.csv

请告知,正在寻找您的建议…

复杂GNUawk解决方案:

比较.awk脚本:

function prNot(n) { 
    r=s="NOT"; while(--n) r=r FS s; 
    return r 
}
BEGIN{ FS=OFS="," }
NR==FNR{ 
    if (NR==1) { 
        sep_nf=NF; sep_fn=FILENAME; h=$0 
    } else { 
        sep[$1][++c]=$2; 
        for(i=3;i<=NF;i++){ sep[$1][c]=sep[$1][c] FS $i } 
    }
    next 
}
FNR==1{ 
    aug_nf=NF; aug_fn=FILENAME; print $0,h,"Remarks"; next 
}
$1 in sep{ matched[$1]; for(i in sep[$1]) print $0,$1,sep[$1][i],"Matched" }
!($1 in sep){ print $0,prNot(sep_nf),"Not in "sep_fn }
END{ 
    for(i in sep) 
        if (!(i in matched)) { 
            for(j in sep[i]) print prNot(aug_nf),i,sep[i][j],"Not in "aug_fn 
        }  
}
输出:

Name,Age,Place,Des,Name,Age,Place,Edu,Des,Remarks
aaa,40,xxx,Aug,aaa,50,zzz,eee,Sep,Matched
aaa,40,xxx,Aug,aaa,60,yyy,fff,Sep,Matched
aaa,20,yyy,Aug,aaa,50,zzz,eee,Sep,Matched
aaa,20,yyy,Aug,aaa,60,yyy,fff,Sep,Matched
ccc,35,xxx,Aug,NOT,NOT,NOT,NOT,NOT,Not in Sep.csv
NOT,NOT,NOT,NOT,bbb,30,xxx,yyy,Sep,Not in Aug.csv
NOT,NOT,NOT,NOT,bbb,50,yyy,fff,Sep,Not in Aug.csv

使用GNU awk实现真正的多维阵列:

$ cat tst.awk
BEGIN { FS=OFS="," }
FNR==1 {
    for (i=1; i<=NF; i++) {
        nots[ARGIND] = (i>1 ? nots[ARGIND] OFS : "") "NOT"
    }
}
NR==FNR {
    file1[$1][++cnt[$1]] = $0
    next
}
{
    file2[$1]
    if ($1 in file1) {
        for (num in file1[$1]) {
            print file1[$1][num], $0, (FNR>1 ? "Matched" : "Remarks")
        }
    }
    else {
        print nots[1], $0, "Not in " ARGV[1]
    }
}
END {
    for (name in file1) {
        if ( !(name in file2) ) {
            for (num in file1[name]) {
                print file1[name][num], nots[2], "Not in " ARGV[2]
            }
        }
    }
}
如果输出顺序很重要,那么有多种方法来处理它