如何在Linux/Unix中查找与标识符匹配的字符序列?
我有一个名为如何在Linux/Unix中查找与标识符匹配的字符序列?,linux,bash,unix,sed,grep,Linux,Bash,Unix,Sed,Grep,我有一个名为mytext.fasta的fasta文件 mytext.fasta >lcl|NW_001820834.1_gene_4 [locus_tag=SS1G_01081] [db_xref=GeneID:5493597] [partial=5',3'] [location=complement(<6452..>8801)] [gbkey=Gene] ATGCAATTGGCAGCAGTCCTAAGCCTCGTGGGCTTGGTTACGGCTCAATGTCCGTACGGAT
mytext.fasta
的fasta文件
mytext.fasta
>lcl|NW_001820834.1_gene_4 [locus_tag=SS1G_01081] [db_xref=GeneID:5493597] [partial=5',3'] [location=complement(<6452..>8801)] [gbkey=Gene]
ATGCAATTGGCAGCAGTCCTAAGCCTCGTGGGCTTGGTTACGGCTCAATGTCCGTACGGATTTGACACAC
CACTTCAAAAGCGTGAATCTATTGATGCTCAAGCCAGTAGTTCTAGTTTCTTGAATCAATTCACAATTAA
CGATACCGATGCACACTTTACCACCGACGCAGGTGGGCCTATGCAAGAGGACACTAGTTTGAAAGCTGGG
>lcl|NW_001820834.1_gene_5 [locus_tag=SS1G_01082] [db_xref=GeneID:5493601] [partial=5',3'] [location=<9695..>10785] [gbkey=Gene]
ATGTTTTCCGGTCCCCAGAAACTTGGCAACGCCAAACAAAAATCAATTGGCCTCGCTTGTCACACAATTA
GTCCCCACGAAGCCTTGTACAAACTAGCCACTGGCTCGTCCCGGACCATTAGGGCAATGTTCAACAGAGA
>lcl|NW_001820834.1_gene_6 [locus_tag=SS1G_01083] [db_xref=GeneID:5494096] [partial=5',3'] [location=<12203..>15199] [gbkey=Gene]
ATGAGAGGCAAGCTTGGTGTCACAGTTGCTGCATTTGCGACGGCATTTCTAAATACGACACTTGCTCAAG
ACTCAACATCATCACAAGCGGATGCGGATACTACCACAAGTTATTGTCCCGTTTACACGCTCACAGCTTC
AGTTGATGCCAGCGCACCTATTATCCCAAACATCCACGATCCGCAGGCAATTAATCCACAAGATGTTTGT
CCGGGGTATACTGCATCCAATGTGAAGCGAACCTCTCACGGATTGACGGCTTCTCTGTCATTGGCTGGTG
相反,我想得到:
>lcl|NW_001820834.1_gene_5 [locus_tag=SS1G_01082] [db_xref=GeneID:5493601] [partial=5',3'] [location=<9695..>10785] [gbkey=Gene]
ATGTTTTCCGGTCCCCAGAAACTTGGCAACGCCAAACAAAAATCAATTGGCCTCGCTTGTCACACAATTA
GTCCCCACGAAGCCTTGTACAAACTAGCCACTGGCTCGTCCCGGACCATTAGGGCAATGTTCAACAGAGA
>lcl | NW_001820834.1_gene_5[locus_tag=SS1G_01082][db_xref=GeneID:5493601][partial=5',3'][location=10785][gbkey=gene]
ATGTTTTCGGTCCCAGAAACTTGGCAACGCCAACAAAATCAATTTGGCCTCGCTTGTCACACAATTA
GTCCACGCAAGCCTTGTACAACTAGCCATGCTCGTCCCGGACCATAGGCAATGTTCAAGA
如果您注意到,在这个文件中,每个序列都以
开头,所以我想在执行grep时获得序列的完整长度。如何完成此操作?使用自定义的RS
,使用gnu awk
更容易:
awk -v RS='(^|\n)>' '/SS1G_01082/{print RT $0}' file
>lcl | NW_001820834.1_gene_5[locus_tag=SS1G_01082][db_xref=GeneID:5493601][partial=5',3'][location=10785][gbkey=gene]
ATGTTTTCGGTCCCAGAAACTTGGCAACGCCAACAAAATCAATTTGGCCTCGCTTGTCACACAATTA
GTCCACGCAAGCCTTGTACAACTAGCCATGCTCGTCCCGGACCATAGGCAATGTTCAAGA
@anbhava谢谢,但我不知道为什么在我做管道时这不起作用。我的命令如下:esearch-db nuccore-q'SS1G_01082[gene]'| efilter-source refseq-molecular genomic | efetch-format gene_fasta | awk-v RS='/SS1G_01082/'
。最后一位(awk-vrs='/SS1G_01082/'
)应该过滤掉所需的序列,但它给了我一切。ifesearch-db nuccore-q'SS1G_01082[基因]“| efilter-source refseq-molecular genomic | efetch-format gene_fasta
命令给出与上面所示完全相同的输出,即每个记录后都有一个空行,那么这个awk
应该有效否,输出没有空行。我已经更新了我的问题。
awk -v RS='(^|\n)>' '/SS1G_01082/{print RT $0}' file
>lcl|NW_001820834.1_gene_5 [locus_tag=SS1G_01082] [db_xref=GeneID:5493601] [partial=5',3'] [location=<9695..>10785] [gbkey=Gene]
ATGTTTTCCGGTCCCCAGAAACTTGGCAACGCCAAACAAAAATCAATTGGCCTCGCTTGTCACACAATTA
GTCCCCACGAAGCCTTGTACAAACTAGCCACTGGCTCGTCCCGGACCATTAGGGCAATGTTCAACAGAGA