Sorting 按列的长度对大型文本文件排序
我有一个大约2Gb的FASTA(文本)文件,需要按第4列的长度排序。看起来像Sorting 按列的长度对大型文本文件排序,sorting,bioinformatics,large-files,Sorting,Bioinformatics,Large Files,我有一个大约2Gb的FASTA(文本)文件,需要按第4列的长度排序。看起来像 MERCURE:174:C0UT3ACXX:5:2316:18091:100842/1 + dogpremirnas 4910 AAAAAAAAAA DDC@BBDDDD 0 3:T>A,9:T>A MERCURE:174:C0UT3ACXX:5:2316:18110:100902/1 + dogpremirnas 4909 AAAAAAAA
MERCURE:174:C0UT3ACXX:5:2316:18091:100842/1 + dogpremirnas 4910 AAAAAAAAAA DDC@BBDDDD 0 3:T>A,9:T>A
MERCURE:174:C0UT3ACXX:5:2316:18110:100902/1 + dogpremirnas 4909 AAAAAAAAAA DDDDDBDDBD 0 0:G>A,4:T>A
MERCURE:174:C0UT3ACXX:5:2316:18153:100840/1 - dogpremirnas 2269 TTTTTTTTTTT BDDB>9<@A>< 0 5:C>T,9:C>T
MERCURE:174:C0UT3ACXX:5:2316:18259:100924/1 + dogpremirnas 833 ACCGATCTCGTA CHHFCC8ACBBB 0 6:G>C,7:C>T,8:T>C
MERCURE:174:C0UT3ACXX:5:2316:18344:100886/1 + dogpremirnas 11734 AAAAAAAAAA DCDCDDDDDD 0 4:C>A,9:G>A
MERCURE:174:C0UT3ACXX:5:2316:18415:100878/1 + dogpremirnas 4909 AAAAAAAAAA BDDCDDDDDB 0 0:G>A,4:T>A
MERCURE:174:C0UT3ACXX:5:2316:18442:100808/1 + dogpremirnas 11734 AAAAAAAAAA DDDDDDDDDB 0 4:C>A,9:G>A
MERCURE:174:C0UT3ACXX:5:2316:18461:100754/1 + dogpremirnas 4914 AAAAAAAAAA DDDDDDDBDB 0 5:T>A,6:T>A
MERCURE:174:C0UT3ACXX:5:2316:18464:100926/1 + dogpremirnas 833 ACCGATCTCGTA HHHFCC/=CBBB 0 6:G>C,7:C>T,8:T>C
MERCURE:174:C0UT3ACXX:5:2316:18091:100842/1+dogpremirnas4910aaaaaaDDC@BBDDDD03:T>A,9:T>A
美居:174:C0UT3ACXX:5:2316:18110:100902/1+dogpremirnas 4909 aaaaaaaaaaaaaaaaaaaaaaddddbddbd 0:G>A,4:T>A
美居:174:C0UT3ACXX:5:2316:18153:100840/1-狗普瑞米纳斯2269 ttttttttttttbddb>9<05:C>T,9:C>T
美居:174:C0UT3ACXX:5:2316:18259:100924/1+dogpremirnas 833 ACCGATCTCCGTA CHHFCC8ACBBB 0 6:G>C,7:C>T,8:T>C
美居:174:C0UT3ACXX:5:2316:18344:100886/1+dogpremirnas 11734 aaaaaaaaaaaaaa dcdcdddd 0 4:C>A,9:G>A
美居:174:C0UT3ACXX:5:2316:18415:100878/1+dogpremirnas 4909 aaaaaaaaaaaaaaaaaa bddcdddb 0:G>A,4:T>A
美居:174:C0UT3ACXX:5:2316:18442:100808/1+dogpremirnas 11734 aaaaaaaaaaaaaaaadddd b0 4:C>A,9:G>A
美居:174:C0UT3ACXX:5:2316:18461:100754/1+dogpremirnas 4914 aaaaaaaaaaaaaaaaaaddddddbdb 0 5:T>A,6:T>A
美居:174:C0UT3ACXX:5:2316:18464:100926/1+dogpremirnas 833 ACCGATCTCCGTA HHHFCC/=CBBB 06:G>C,7:C>T,8:T>C
并且需要根据列的长度进行排序。在sort命令的手册页中,它说我可以指定键,但没有说明如何在其中输入“长度”。
我只需要在第4列中有超过20个符号的行。不幸的是,让我得到这个结果的软件(bowtie)也没有提供这样的请求
欢迎提出任何建议。
谢谢。我喜欢awk处理以下列数据:
awk 'length($5)>20' /path/to/input > outputfile
您可以使用常用的linux工具来实现这一点,但它超出了内存,可能需要一些其他东西。1)添加一个额外的列,包含第四个字段的长度,并在新字段上排序,或2)创建您自己的排序程序