Sed/Awk:如果第一行中的图案重复,如何查找和删除两行;猛击

Sed/Awk:如果第一行中的图案重复,如何查找和删除两行;猛击,awk,sed,grep,fasta,Awk,Sed,Grep,Fasta,我正在处理每个文件包含数千条记录的文本文件。每条记录由两行组成:一个标题以“>”开头,后面是一行长字符串“-AGTCNR”。标题有10个字段,由“|”分隔,其第一个字段是每个记录的唯一标识符,例如“>KEN096-15”,如果记录具有相同标识符,则称为重复记录。下面是一个简单记录的外观: >ACML500-12|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_-2 ----TAAGATTTTGACTTCTTCCCCCATCATCAAGAAGAA

我正在处理每个文件包含数千条记录的文本文件。每条记录由两行组成:一个标题以“>”开头,后面是一行长字符串“-AGTCNR”。标题有10个字段,由“|”分隔,其第一个字段是每个记录的唯一标识符,例如“>KEN096-15”,如果记录具有相同标识符,则称为重复记录。下面是一个简单记录的外观:

>ACML500-12|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_-2  
----TAAGATTTTGACTTCTTCCCCCATCATCAAGAAGAATTGT-------  
>ACRJP458-10|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_N  
-----------TCCCTTTAATACTAGGAGCCCCTGACATAGCCTTTCCTAAATAAT-----  
>ASILO303-17|Dip|gs-Par|sp-Par vid|subsp-NA|co  
-------TAAGATTCTGATTACTCCCCCCCTCTCTAACTCTTCTTCTTCTATAGTAGATG  
>ASILO326-17|Dip|gs-Goe|sp-Goe par|subsp-NA|c  
TAAGATTTTGATTATTACCCCCTTCATTAACCAGGAACAGGATGA---------------  
>CLT100-09|Lep|gs-Col|sp-Col elg|subsp-NA|co-Buru  
AACATTATATTTGGAATTT-------GATCAGGAATAGTCGGAACTTCTCTGAA------  
>PMANL2431-12|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_  
----ATGCCTATTATAATTGGAGGATTTGGAAAACCTTTAATATT----CCGAAT  
>STBOD057-09|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_N  
ATCTAATATTGCACATAGAGGAACCTCNGTATTTTTTCTCTCCATCT------TTAG  
>TBBUT582-11|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_N  
-----CCCCCTCATTAACATTACTAAGTTGAAAATGGAGCAGGAACAGGATGA  
>TBBUT583-11|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_N  
TAAGATTTTGACTCATTAA----------------AATGGAGCAGGAACAGGATGA  
>AFBTB001-09|Col|gs-NA|sp-NA|subsp-NA|co-Ethi|site-NA|lat_N  
TAAGCTCCATCC-------------TAGAAAGAGGGG---------GGGTGA  
>PMANL2431-12|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_  
----ATGCCTATTAGGAAATTGATTAGTACCTTTAATATT----CCGAAT---  
>AFBTB003-09|Col|gs-NA|sp-NA|subsp-NA|co-Ethi|site-NA|lat_N  
TAAGATTTTGACTTCTGC------CATGAGAAAGA-------------AGGGTGA  
>AFBTB002-09|Cole|gs-NA|sp-NA|subsp-NA|co-Ethi|site-NA|lat_N  
-------TCTTCTGCTCAT-------GGGGCAGGAACAGGG----------TGA  
>ACRJP458-10|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_N  
-----------TCCCTTTAATACTAGGAGCCCCTTTCCT----TAAATAAT-----  
>ACML500-12|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_-2  
----TAAGATTTTGACTTCTTCCCCCATCATCAAGAAGAATTGT-------  
>ACRJP458-10|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_N  
-----------TCCCTTTAATACTAGGAGCCCCTGACATAGCCTTTCCTAAATAAT-----  
>ASILO303-17|Dip|gs-Par|sp-Par vid|subsp-NA|co  
-------TAAGATTCTGATTACTCCCCCCCTCTCTAACTCTTCTTCTTCTATAGTAGATG  
>ASILO326-17|Dip|gs-Goe|sp-Goe par|subsp-NA|c  
TAAGATTTTGATTATTACCCCCTTCATTAACCAGGAACAGGATGA---------------  
>CLT100-09|Lep|gs-Col|sp-Col elg|subsp-NA|co-Buru  
AACATTATATTTGGAATTT-------GATCAGGAATAGTCGGAACTTCTCTGAA------  
>PMANL2431-12|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_  
----ATGCCTATTATAATTGGAGGATTTGGAAAACCTTTAATATT----CCGAAT  
>STBOD057-09|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_N  
ATCTAATATTGCACATAGAGGAACCTCNGTATTTTTTCTCTCCATCT------TTAG  
>TBBUT582-11|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_N  
-----CCCCCTCATTAACATTACTAAGTTGAAAATGGAGCAGGAACAGGATGA  
>TBBUT583-11|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_N  
TAAGATTTTGACTCATTAA----------------AATGGAGCAGGAACAGGATGA  
>AFBTB001-09|Col|gs-NA|sp-NA|subsp-NA|co-Ethi|site-NA|lat_N  
TAAGCTCCATCC-------------TAGAAAGAGGGG---------GGGTGA  
>AFBTB003-09|Col|gs-NA|sp-NA|subsp-NA|co-Ethi|site-NA|lat_N  
TAAGATTTTGACTTCTGC------CATGAGAAAGA-------------AGGGTGA  
>AFBTB002-09|Cole|gs-NA|sp-NA|subsp-NA|co-Ethi|site-NA|lat_N  
-------TCTTCTGCTCAT-------GGGGCAGGAACAGGG----------TGA
现在我正在尝试删除重复记录,比如“ACRJP458-10”和“PMANL2431-12”的重复记录。 我使用bash脚本提取了唯一标识符,并将重复的标识符存储在变量“$duplicate_headers”中。目前,我正在尝试查找其两行记录的任何重复实例,并按如下方式删除它们:

“$@中i的
”
做
取消设置重复的\u头
重复的_头=`grep'>“$1|awk'开始{FS=“|”};{print$1“\n”}| sort | uniq-d`
对于`echo-e“${duplicate_headers}”中的头`
做
sed-i“/^.*\b${header}\b.*$/,+1 2d”$i
#sed-i“s/^.*\b${header}\b.*$/,+1 2g”$i
#sed-i“/^.*\b${header}\b.*$/{$!N;s/*//2g;}”$i
完成
完成
最终结果(考虑到数千条记录)如下所示:

>ACML500-12|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_-2  
----TAAGATTTTGACTTCTTCCCCCATCATCAAGAAGAATTGT-------  
>ACRJP458-10|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_N  
-----------TCCCTTTAATACTAGGAGCCCCTGACATAGCCTTTCCTAAATAAT-----  
>ASILO303-17|Dip|gs-Par|sp-Par vid|subsp-NA|co  
-------TAAGATTCTGATTACTCCCCCCCTCTCTAACTCTTCTTCTTCTATAGTAGATG  
>ASILO326-17|Dip|gs-Goe|sp-Goe par|subsp-NA|c  
TAAGATTTTGATTATTACCCCCTTCATTAACCAGGAACAGGATGA---------------  
>CLT100-09|Lep|gs-Col|sp-Col elg|subsp-NA|co-Buru  
AACATTATATTTGGAATTT-------GATCAGGAATAGTCGGAACTTCTCTGAA------  
>PMANL2431-12|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_  
----ATGCCTATTATAATTGGAGGATTTGGAAAACCTTTAATATT----CCGAAT  
>STBOD057-09|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_N  
ATCTAATATTGCACATAGAGGAACCTCNGTATTTTTTCTCTCCATCT------TTAG  
>TBBUT582-11|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_N  
-----CCCCCTCATTAACATTACTAAGTTGAAAATGGAGCAGGAACAGGATGA  
>TBBUT583-11|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_N  
TAAGATTTTGACTCATTAA----------------AATGGAGCAGGAACAGGATGA  
>AFBTB001-09|Col|gs-NA|sp-NA|subsp-NA|co-Ethi|site-NA|lat_N  
TAAGCTCCATCC-------------TAGAAAGAGGGG---------GGGTGA  
>PMANL2431-12|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_  
----ATGCCTATTAGGAAATTGATTAGTACCTTTAATATT----CCGAAT---  
>AFBTB003-09|Col|gs-NA|sp-NA|subsp-NA|co-Ethi|site-NA|lat_N  
TAAGATTTTGACTTCTGC------CATGAGAAAGA-------------AGGGTGA  
>AFBTB002-09|Cole|gs-NA|sp-NA|subsp-NA|co-Ethi|site-NA|lat_N  
-------TCTTCTGCTCAT-------GGGGCAGGAACAGGG----------TGA  
>ACRJP458-10|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_N  
-----------TCCCTTTAATACTAGGAGCCCCTTTCCT----TAAATAAT-----  
>ACML500-12|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_-2  
----TAAGATTTTGACTTCTTCCCCCATCATCAAGAAGAATTGT-------  
>ACRJP458-10|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_N  
-----------TCCCTTTAATACTAGGAGCCCCTGACATAGCCTTTCCTAAATAAT-----  
>ASILO303-17|Dip|gs-Par|sp-Par vid|subsp-NA|co  
-------TAAGATTCTGATTACTCCCCCCCTCTCTAACTCTTCTTCTTCTATAGTAGATG  
>ASILO326-17|Dip|gs-Goe|sp-Goe par|subsp-NA|c  
TAAGATTTTGATTATTACCCCCTTCATTAACCAGGAACAGGATGA---------------  
>CLT100-09|Lep|gs-Col|sp-Col elg|subsp-NA|co-Buru  
AACATTATATTTGGAATTT-------GATCAGGAATAGTCGGAACTTCTCTGAA------  
>PMANL2431-12|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_  
----ATGCCTATTATAATTGGAGGATTTGGAAAACCTTTAATATT----CCGAAT  
>STBOD057-09|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_N  
ATCTAATATTGCACATAGAGGAACCTCNGTATTTTTTCTCTCCATCT------TTAG  
>TBBUT582-11|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_N  
-----CCCCCTCATTAACATTACTAAGTTGAAAATGGAGCAGGAACAGGATGA  
>TBBUT583-11|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_N  
TAAGATTTTGACTCATTAA----------------AATGGAGCAGGAACAGGATGA  
>AFBTB001-09|Col|gs-NA|sp-NA|subsp-NA|co-Ethi|site-NA|lat_N  
TAAGCTCCATCC-------------TAGAAAGAGGGG---------GGGTGA  
>AFBTB003-09|Col|gs-NA|sp-NA|subsp-NA|co-Ethi|site-NA|lat_N  
TAAGATTTTGACTTCTGC------CATGAGAAAGA-------------AGGGTGA  
>AFBTB002-09|Cole|gs-NA|sp-NA|subsp-NA|co-Ethi|site-NA|lat_N  
-------TCTTCTGCTCAT-------GGGGCAGGAACAGGG----------TGA
要同时在多个文件上运行它,需要在所有文件中删除重复项:

awk -F'[|]' 'FNR%2{f=seen[$1]++} !f' *
或仅删除每个文件中的重复项:

awk -F'[|]' 'FNR==1{delete seen} FNR%2{f=seen[$1]++} !f' *

可能只需使用
-F'|'