删除重复项,但只保留linux文件中最后一次出现的内容

删除重复项,但只保留linux文件中最后一次出现的内容,linux,shell,awk,Linux,Shell,Awk,输入文件: 5,,OR1,1000,Nawras,OR,20160105T05:30:17+0400,20181231T23:59:59+0400,,user,,aaa8016058f008ddceae6329f0c5d551,50293277591,,,30001,C 5,,OR1,1000,Nawras,OR,20160105T05:30:17+0400,20181231T23:59:59+0400,20160217T01:45:18+0400,,user,aaa8016058f008ddc

输入文件:

5,,OR1,1000,Nawras,OR,20160105T05:30:17+0400,20181231T23:59:59+0400,,user,,aaa8016058f008ddceae6329f0c5d551,50293277591,,,30001,C
5,,OR1,1000,Nawras,OR,20160105T05:30:17+0400,20181231T23:59:59+0400,20160217T01:45:18+0400,,user,aaa8016058f008ddceae6329f0c5d551,50293277591,,,30001,H
5,,OR2,2000,Nawras,OR,20160216T06:30:18+0400,20191231T23:59:59+0400,,user,,f660818af5625b3be61fe12489689601,50328589469,,,30002,C
5,,OR2,2000,Nawras,OR,20160216T06:30:18+0400,20191231T23:59:59+0400,20160216T06:30:18+0400,,user,f660818af5625b3be61fe12489689601,50328589469,,,30002,H
5,,OR1,1000,Nawras,OR,20150328T03:00:13+0400,20171230T23:59:59+0400,,user,,22bf18b024e1d4f42ac79943062cf576,50212935879,,,10001,C
5,,OR1,1000,Nawras,OR,20150328T03:00:13+0400,20171230T23:59:59+0400,20150328T03:00:13+0400,,user,22bf18b024e1d4f42ac79943062cf576,50212935879,,,10001,H
0,,OR5,5000,Nawras,OR,20160421T02:45:16+0400,20191231T23:59:59+0400,,user,,c7c501ac92d85a04bb26c575929e9317,50329769192,,,11001,C
0,,OR5,5000,Nawras,OR,20160421T02:45:16+0400,20191231T23:59:59+0400,20160421T02:45:16+0400,,user,c7c501ac92d85a04bb26c575929e9317,50329769192,,,11001,H
0,,OR1,1000,Nawras,OR,20160330T02:00:14+0400,20181231T23:59:59+0400,,user,,d4ea749306717ec5201d264fc8044201,50285524333,,,11001,C
所需输出:

5,,OR1,1000,UY,OR,20160105T05:30:17+0400,20181231T23:59:59+0400,20160217T01:45:18+0400,,user,aaa8016058f008ddceae6329f0c5d551,50293277591,,,30001,H 
5,,OR2,2000,UY,OR,20160216T06:30:18+0400,20191231T23:59:59+0400,20160216T06:30:18+0400,,user,f660818af5625b3be61fe12489689601,50328589469,,,30002,H    
5,,OR1,1000,UY,OR,20150328T03:00:13+0400,20171230T23:59:59+0400,20150328T03:00:13+0400,,user,22bf18b024e1d4f42ac79943062cf576,50212935879,,,10001,H    
0,,OR5,5000,UY,OR,20160421T02:45:16+0400,20191231T23:59:59+0400,20160421T02:45:16+0400,,user,c7c501ac92d85a04bb26c575929e9317,50329769192,,,11001,H
0,,OR1,1000,UY,OR,20160330T02:00:14+0400,20181231T23:59:59+0400,,user,,d4ea749306717ec5201d264fc8044201,50285524333,,,11001,C*
使用的代码:

for i in `cat file | awk -F, '{print $13}' | sort | uniq`
do
grep $i file | tail -1 >> TESTINGGGGGGG_SV
done
这花了很多时间,因为该文件有3亿条记录,在第13列有6500万条uniq记录


所以我需要一个可以遍历第13列值的输出——文件中最后一个出现的值作为输出

awk
救援

awk -F, 'p!=$13 && p0 {print p0} {p=$13; p0=$0} END{print p0}' file
需要排序的输入

如果可以成功运行脚本,请发布时间安排

如果无法排序,则另一个选项是

tac file | awk -F, '!a[$13]++' | tac

反转文件,以$13的价格获取第一个条目,并将结果反转回来。

以下是一个可行的解决方案:

awk -F, '{rows[$13]=$0} END {for (i in rows) print rows[i]}' file
说明:

  • 是一个由字段13索引的关联数组
    $13
    ,由
    $13
    索引的数组元素在每次字段13重复时都会被覆盖;它的值是整行
    $0
但这在内存方面效率很低,因为保存阵列需要空间

对上述仍然不使用排序的解决方案的一个改进是只将行号保存在关联数组中:

awk -F, '{rows[$13]=NR}END {for(i in rows) print rows[i]}' file|while read lN; do sed "${lN}q;d" file; done
说明:

  • 与前面一样,但值是行号,而不是整行
  • awk-F,'{rows[$13]=NR}END{for(i in rows)print rows[i]}'
    文件输出包含所需行的行号列表
  • sed“${lN}q;d”
    文件中获取行号
    lN

perl-F,-le'$seen{$F[12]}=$;结束{print$seen{${}排序键%seen}
您想过程序将使用多少内存吗?6500万条独特记录。如果每条记录是50字节,它将变成大约3 GB的原始数据,而不计算AWK需要什么来保持阵列的结构化。自己计算
perl-le'print 65_000_000*50/1024/1024/1024'