awk完全独立的副本和非副本_Awk_Duplicates

awk完全独立的副本和非副本

awk

awk完全独立的副本和非副本,awk,duplicates,Awk,Duplicates,如果我们有意见： TargetIDs,CPD,Value,SMILES 95,CPD-1111111,-2,c1ccccc1 95,CPD-2222222,-3,c1ccccc1 95,CPD-2222222,-4,c1ccccc1 95,CPD-3333333,-1,c1ccccc1N 现在我们要根据第四列将重复项和非重复项分开重复： 95,CPD-1111111,-2,c1ccccc1 95,CPD-2222222,-3,c1ccccc1 95,CPD-2222222,-4,c1cccc

如果我们有意见：

TargetIDs,CPD,Value,SMILES
95,CPD-1111111,-2,c1ccccc1
95,CPD-2222222,-3,c1ccccc1
95,CPD-2222222,-4,c1ccccc1
95,CPD-3333333,-1,c1ccccc1N

现在我们要根据第四列将重复项和非重复项分开

重复：

95,CPD-1111111,-2,c1ccccc1
95,CPD-2222222,-3,c1ccccc1
95,CPD-2222222,-4,c1ccccc1

非重复

95,CPD-3333333,-1,c1ccccc1N

现在，下面的尝试可以毫无问题地分离副本。但是，第一次出现的副本仍将包含在非重复文件中

BEGIN { FS = ","; f1="a"; f2="b"}

{
# Keep count of the fields in fourth column
count[$4]++;

# Save the line the first time we encounter a unique field
if (count[$4] == 1)
    first[$4] = $0;


# If we encounter the field for the second time, print the
# previously saved line
if (count[$4] == 2)
    print first[$4] > f1 ;

# From the second time onward. always print because the field is
# duplicated
if (count[$4] > 1)
    print > f1;

if (count[$4] == 1)      #if (count[$4] - count[$4] == 0)    <= change to this doesn't work
    print first[$4] > f2;

尝试的输出结果不重复

TargetIDs,CPD,Value,SMILES
95,CPD-3333333,-1,c1ccccc1N
95,CPD-1111111,-2,c1ccccc1

我可以知道是否有大师有意见/解决方案吗？谢谢。

我会这样做：

awk '
    NR==FNR {count[$2] = $1; next} 
    FNR==1  {FS=","; next} 
    {
        output = (count[$NF] == 1 ? "nondup" : "dup")
        print > output
    }
' <(cut -d, -f4 input | sort | uniq -c) input

是的，输入文件提供两次。

此解决方案仅在将重复项分组在一起时有效

awk -F, '
  function fout(    f, i) {
    f = (cnt > 1) ? "dups" : "nondups"
    for (i = 1; i <= cnt; ++i)
      print lines[i] > f
  }
  NR > 1 && $4 != lastkey { fout(); cnt = 0 }
  { lastkey = $4; lines[++cnt] = $0 }
  END { fout() }
' file

有点晚了我的awk版本

awk -F, 'NR>1{a[$0":"$4];b[$4]++}
        END{d="\n\nnondupe";e="dupe"
        for(i in a){split(i,c,":");b[c[2]]==1?d=d"\n"i:e=e"\n"i} print e d}' file

另一个类似于格伦·杰克曼斯的建筑，但全部采用awk

awk -F, 'function r(f) {while((getline <f)>0)a[$4]++;close(f)}
BEGIN{r(ARGV[1])}{output=(a[$4] == 1 ? "nondup" : "dup");print >output} ' file

保持行的顺序重要吗？不重要。我总是可以在以后对它进行排序：重复的行总是分组在一起吗？+1。但是，为什么要麻烦使用一个单独的输出变量，而不仅仅是{print>count[$NF]==1？nondup:dup}？我不认为这有什么不清楚的。

$ cat tst.awk
BEGIN{ FS="," }
NR>1 {
    if (cnt[$4]++) {
        dups[$4] = nonDups[$4] dups[$4] $0 ORS
        delete nonDups[$4]
    }
    else {
        nonDups[$4] = $0 ORS
    }
}
END {
    print "Duplicates:"
    for (key in dups) {
        printf "%s", dups[key]
    }

    print "\nNon Duplicates:"
    for (key in nonDups) {
        printf "%s", nonDups[key]
    }
}

$ awk -f tst.awk file
Duplicates:
95,CPD-1111111,-2,c1ccccc1
95,CPD-2222222,-3,c1ccccc1
95,CPD-2222222,-4,c1ccccc1

Non Duplicates:
95,CPD-3333333,-1,c1ccccc1N

awk -F, 'NR>1{a[$0":"$4];b[$4]++}
        END{d="\n\nnondupe";e="dupe"
        for(i in a){split(i,c,":");b[c[2]]==1?d=d"\n"i:e=e"\n"i} print e d}' file

awk -F, 'function r(f) {while((getline <f)>0)a[$4]++;close(f)}
BEGIN{r(ARGV[1])}{output=(a[$4] == 1 ? "nondup" : "dup");print >output} ' file