Shell awk/sed:文件之间的匹配模式以及在匹配之间打印所有内容

Shell awk/sed:文件之间的匹配模式以及在匹配之间打印所有内容,shell,awk,sed,grep,pattern-matching,Shell,Awk,Sed,Grep,Pattern Matching,我试图将主题和问题结合起来,即将File2中的每个字符串/行与其在File1中的出现(每个字符串只出现一次)进行匹配,同时打印它在File2中出现的整行,同时还打印每个匹配之间的行(即File2中的顺序) 文件1 文件2 到目前为止,我所拥有的: awk 'FNR==NR{a[$0];next} $1 in a' file2 file1 > output 其中: >GAXI01000525.151.1950 Eukaryota;Opisthokonta;Holozoa;Metazo

我试图将主题和问题结合起来,即将
File2
中的每个字符串/行与其在
File1
中的出现(每个字符串只出现一次)进行匹配,同时打印它在
File2
中出现的整行,同时还打印每个匹配之间的行(即
File2
中的顺序)

文件1

文件2

到目前为止,我所拥有的:

awk 'FNR==NR{a[$0];next} $1 in a' file2 file1 > output
其中:

>GAXI01000525.151.1950 Eukaryota;Opisthokonta;Holozoa;Metazoa (Animalia);Eumetazoa;Bilateria;Arthropoda;Hexapoda;Ellipura;Collembola;Tetrodontophora bielanensis (giant springtail)
>GAXI01006199.29.1525 Bacteria;Chlamydiae;Chlamydiae;Chlamydiales;Simkaniaceae;Candidatus Rhabdochlamydia;Tetrodontophora bielanensis (giant springtail)
我想这样:

>GAXI01000525.151.1950 Eukaryota;Opisthokonta;Holozoa;Metazoa (Animalia);Eumetazoa;Bilateria;Arthropoda;Hexapoda;Ellipura;Collembola;Tetrodontophora bielanensis (giant springtail)
CCUGGUUGAUCCUGCCAGUAGUCAUAUGCUUGUCUCAAA
GAUUAAGCCAUGCAUGUCUAAGUUCAAGCAAAAAUAAAG
ACCGCGAAUGGCUCAUUAUAUCAGUUAUGGUUCCUUAGA
ACUUACUACUUGGAUAACUGUGGUAAUUCUAGAGCUAAU
>GAXI01006199.29.1525 Bacteria;Chlamydiae;Chlamydiae;Chlamydiales;Simkaniaceae;Candidatus Rhabdochlamydia;Tetrodontophora bielanensis (giant springtail)
AGAAUUUGAUCUUGGUUCAGAUUGAAUGCUGG
UGCAAGUCGAACGAAGCUAGAGGGCAACCUCU
原始文件包含数千行,因此,无论是awk、sed还是其他任何解决方案,最快的解决方案都值得赞赏

@jO:试试看:

awk 'FNR==NR{A[$1];next} ($0 ~ /^>/){Q=""} ($1 in A){Q=1} Q{print}' file2  file1
编辑:在此处添加解释,以便立即解决问题

awk 'FNR==NR        ##### This condition will be TRUE when only file2 is being read. where FNR and NR are the awk's in-built keywords FNR and NR both shows number of lines in a Input_file only difference between them FNR gets RESET when it reads next file and NR keep on increase it's values till all files get read successfully.
{A[$1];             ##### creating an array named A whose index is $1 first field of file2.
next}               ##### putting next will skip all the further statements.
                    ##### All further mentioned statements will be executed in file1 only.
($0 ~ /^>/)         ##### checking if any line is starting with > in file1
{Q=""}              ##### Making variable named Q as nullified.
($1 in A)           ##### Checking if current line's $1 is coming into array A, if yes then do following.
{Q=1}               ##### If current $1 is coming into array A then make variable Q's value to 1.
Q                   ##### Check if Q's value is NOT NULL then do following.
{print}             ##### print the lines whenever above condition is TRUE which has Q's value is NOT NULL.
' file2  file1      ##### Mentioning Input_files file2 and file1 here.

您可以尝试使用
awk

awk 'FNR==NR{d[$1]; next}/^>/{f=0}$1 in d{f=1}f' file2 file1
你得到

>GAXI01000525.151.1950 Eukaryota;Opisthokonta;Holozoa;Metazoa (Animalia);Eumetazoa;Bilateria;Arthropoda;Hexapoda;Ellipura;Collembola;Tetrodontophora bielanensis (giant springtail) CCUGGUUGAUCCUGCCAGUAGUCAUAUGCUUGUCUCAAA GAUUAAGCCAUGCAUGUCUAAGUUCAAGCAAAAAUAAAG ACCGCGAAUGGCUCAUUAUAUCAGUUAUGGUUCCUUAGA ACUUACUACUUGGAUAACUGUGGUAAUUCUAGAGCUAAU >GAXI01006199.29.1525 Bacteria;Chlamydiae;Chlamydiae;Chlamydiales;Simkaniaceae;Candidatus Rhabdochlamydia;Tetrodontophora bielanensis (giant springtail) AGAAUUUGAUCUUGGUUCAGAUUGAAUGCUGG UGCAAGUCGAACGAAGCUAGAGGGCAACCUCU >GAXI01000525.151.1950真核生物;奥皮斯托孔塔;全息虫;后生动物(动物);真寄生虫;双层;节肢动物;六足类;埃利普拉;弹尾;比兰河豚 CCUGGUGUGUCGCCAGUGUCAUGUGUUCAAA Gauuaagcaugcaugcuaagucaagcaaaugaaaaag ACCGCGAUGGCUCAUAUAUCAGUGUUCUUAGA 阿库阿库古古古古古古古古古古古古古古古古古古古古古古古古古古古古古古古古古古古古古古古古古古古古古古古古古古古古古古古古古古古古古古古古古 >GAXI01006199.29.1525细菌;衣原体;衣原体;衣原体;辛卡尼亚科;横纹假丝酵母;比兰河豚 阿加乌高高高高高高高高高高高高高高高高高高高高高高高高高高高高高高高高高高高高高高高高高高高高高高高高高高高高高高高高高高高高高高高高高高高高 UGCAAGUCGACGAACGAAGCUAGGGAACCUCU 这可能适用于您(GNU-sed):

将文件2转换为要从文件1打印的匹配项,否则删除不匹配项


使用两个sed调用。第一个使用file2创建要匹配的regexp,第二个使用框架打印匹配到下一个记录开头或文件结尾的行。

IMHO科学家可以使用为他们的工作设计的工具获得更快更好的结果。使用诸如和/或之类的工具可以更有效地实现目标……当然,尽管我不是每天都做这种类型的查询。
awk 'FNR==NR{d[$1]; next}/^>/{f=0}$1 in d{f=1}f' file2 file1
>GAXI01000525.151.1950 Eukaryota;Opisthokonta;Holozoa;Metazoa (Animalia);Eumetazoa;Bilateria;Arthropoda;Hexapoda;Ellipura;Collembola;Tetrodontophora bielanensis (giant springtail) CCUGGUUGAUCCUGCCAGUAGUCAUAUGCUUGUCUCAAA GAUUAAGCCAUGCAUGUCUAAGUUCAAGCAAAAAUAAAG ACCGCGAAUGGCUCAUUAUAUCAGUUAUGGUUCCUUAGA ACUUACUACUUGGAUAACUGUGGUAAUUCUAGAGCUAAU >GAXI01006199.29.1525 Bacteria;Chlamydiae;Chlamydiae;Chlamydiales;Simkaniaceae;Candidatus Rhabdochlamydia;Tetrodontophora bielanensis (giant springtail) AGAAUUUGAUCUUGGUUCAGAUUGAAUGCUGG UGCAAGUCGAACGAAGCUAGAGGGCAACCUCU
sed 's:.*:/^&/bb:' file2 | sed -e ':a' -f - -e 'd;:b;n;/^>/ba;bb' file1