Warning: file_get_contents(/data/phpspider/zhask/data//catemap/8/sorting/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Sorting 按列的长度对大型文本文件排序_Sorting_Bioinformatics_Large Files - Fatal编程技术网

Sorting 按列的长度对大型文本文件排序

Sorting 按列的长度对大型文本文件排序,sorting,bioinformatics,large-files,Sorting,Bioinformatics,Large Files,我有一个大约2Gb的FASTA(文本)文件,需要按第4列的长度排序。看起来像 MERCURE:174:C0UT3ACXX:5:2316:18091:100842/1 + dogpremirnas 4910 AAAAAAAAAA DDC@BBDDDD 0 3:T>A,9:T>A MERCURE:174:C0UT3ACXX:5:2316:18110:100902/1 + dogpremirnas 4909 AAAAAAAA

我有一个大约2Gb的FASTA(文本)文件,需要按第4列的长度排序。看起来像

MERCURE:174:C0UT3ACXX:5:2316:18091:100842/1    +    dogpremirnas    4910    AAAAAAAAAA    DDC@BBDDDD    0    3:T>A,9:T>A
MERCURE:174:C0UT3ACXX:5:2316:18110:100902/1    +    dogpremirnas    4909    AAAAAAAAAA    DDDDDBDDBD    0    0:G>A,4:T>A
MERCURE:174:C0UT3ACXX:5:2316:18153:100840/1    -    dogpremirnas    2269    TTTTTTTTTTT    BDDB>9<@A><    0    5:C>T,9:C>T
MERCURE:174:C0UT3ACXX:5:2316:18259:100924/1    +    dogpremirnas    833    ACCGATCTCGTA    CHHFCC8ACBBB    0    6:G>C,7:C>T,8:T>C
MERCURE:174:C0UT3ACXX:5:2316:18344:100886/1    +    dogpremirnas    11734    AAAAAAAAAA    DCDCDDDDDD    0    4:C>A,9:G>A
MERCURE:174:C0UT3ACXX:5:2316:18415:100878/1    +    dogpremirnas    4909    AAAAAAAAAA    BDDCDDDDDB    0    0:G>A,4:T>A
MERCURE:174:C0UT3ACXX:5:2316:18442:100808/1    +    dogpremirnas    11734    AAAAAAAAAA    DDDDDDDDDB    0    4:C>A,9:G>A
MERCURE:174:C0UT3ACXX:5:2316:18461:100754/1    +    dogpremirnas    4914    AAAAAAAAAA    DDDDDDDBDB    0    5:T>A,6:T>A
MERCURE:174:C0UT3ACXX:5:2316:18464:100926/1    +    dogpremirnas    833    ACCGATCTCGTA    HHHFCC/=CBBB    0    6:G>C,7:C>T,8:T>C
MERCURE:174:C0UT3ACXX:5:2316:18091:100842/1+dogpremirnas4910aaaaaaDDC@BBDDDD03:T>A,9:T>A
美居:174:C0UT3ACXX:5:2316:18110:100902/1+dogpremirnas 4909 aaaaaaaaaaaaaaaaaaaaaaddddbddbd 0:G>A,4:T>A
美居:174:C0UT3ACXX:5:2316:18153:100840/1-狗普瑞米纳斯2269 ttttttttttttbddb>9<05:C>T,9:C>T
美居:174:C0UT3ACXX:5:2316:18259:100924/1+dogpremirnas 833 ACCGATCTCCGTA CHHFCC8ACBBB 0 6:G>C,7:C>T,8:T>C
美居:174:C0UT3ACXX:5:2316:18344:100886/1+dogpremirnas 11734 aaaaaaaaaaaaaa dcdcdddd 0 4:C>A,9:G>A
美居:174:C0UT3ACXX:5:2316:18415:100878/1+dogpremirnas 4909 aaaaaaaaaaaaaaaaaa bddcdddb 0:G>A,4:T>A
美居:174:C0UT3ACXX:5:2316:18442:100808/1+dogpremirnas 11734 aaaaaaaaaaaaaaaadddd b0 4:C>A,9:G>A
美居:174:C0UT3ACXX:5:2316:18461:100754/1+dogpremirnas 4914 aaaaaaaaaaaaaaaaaaddddddbdb 0 5:T>A,6:T>A
美居:174:C0UT3ACXX:5:2316:18464:100926/1+dogpremirnas 833 ACCGATCTCCGTA HHHFCC/=CBBB 06:G>C,7:C>T,8:T>C
并且需要根据列的长度进行排序。在sort命令的手册页中,它说我可以指定键,但没有说明如何在其中输入“长度”。 我只需要在第4列中有超过20个符号的行。不幸的是,让我得到这个结果的软件(bowtie)也没有提供这样的请求

欢迎提出任何建议。
谢谢。

我喜欢awk处理以下列数据:

awk 'length($5)>20' /path/to/input > outputfile

您可以使用常用的linux工具来实现这一点,但它超出了内存,可能需要一些其他东西。1)添加一个额外的列,包含第四个字段的长度,并在新字段上排序,或2)创建您自己的排序程序