Sorting 按字母数字顺序在列上排序_Sorting_Columnsorting

Sorting 按字母数字顺序在列上排序

sorting

Sorting 按字母数字顺序在列上排序,sorting,columnsorting,Sorting,Columnsorting,我有以下文件，我想根据第6列对其进行字母数字排序，这样E1后面跟着I1，然后是E2，依此类推，在“：”之前有一个特定ID，当我对-V-k6文件进行排序时，它会将所有ID:放在末尾，而不是它们应该在的位置。然而，当我对-k6进行排序时，它会将ID的Es和Is放在一起，但一些ID属于不同的系列（我在这里突出显示了它们），如何获得排序，以确保没有两个ID混合，并且列的顺序应为： chr1 259017 259121 104 - ENSG00000228463:E2 chr1 259

我有以下文件，我想根据第6列对其进行字母数字排序，这样E1后面跟着I1，然后是E2，依此类推，在“：”之前有一个特定ID，当我对-V-k6文件进行排序时，它会将所有ID:放在末尾，而不是它们应该在的位置。然而，当我对-k6进行排序时，它会将ID的Es和Is放在一起，但一些ID属于不同的系列（我在这里突出显示了它们），如何获得排序，以确保没有两个ID混合，并且列的顺序应为：

chr1    259017  259121  104 -   ENSG00000228463:E2
chr1    259122  267095  7973    -   ENSG00000228463:I1
chr1    267096  267253  157 -   ENSG00000228463:E1
chr1    317720  317781  61  +   ENSG00000237094:E1
chr1    317782  320161  2379    +   ENSG00000237094:I1
chr1    320162  320653  491 +   ENSG00000237094:E2
chr1    320654  320880  226 +   ENSG00000237094:I2
chr1    320881  320938  57  +   ENSG00000237094:E3
chr1    320939  321031  92  +   ENSG00000237094:I3
chr1    321032  321290  258 +   ENSG00000237094:E4
chr1    321291  322037  746 +   ENSG00000237094:I4
chr1    322038  322228  190 +   ENSG00000237094:E5
chr1    322229  322671  442 +   ENSG00000237094:I5
chr1    322672  323073  401 +   ENSG00000237094:E6
chr1    323074  323860  786 +   ENSG00000237094:I6
chr1    323861  324060  199 +   ENSG00000237094:E7
chr1    324061  324287  226 +   ENSG00000237094:I7
chr1    324288  324345  57  +   ENSG00000237094:E8
chr1    324346  324438  92  +   ENSG00000237094:I8
chr1    324439  326514  2075    +   ENSG00000237094:E9
**chr1  326096  326569  473 +   ENSG00000250575:E1**
chr1    326515  327551  1036    +   ENSG00000237094:I9
**chr1  326570  327347  777 +   ENSG00000250575:I1**
**chr1  327348  328112  764 +   ENSG00000250575:E2**
chr1    327552  328453  901 +   ENSG00000237094:E10
chr1    328454  329783  1329    +   ENSG00000237094:I10
**chr1  329431  329620  189 -   ENSG00000233653:E2**
**chr1  329621  329949  328 -   ENSG00000233653:I1**
chr1    329784  329976  192 +   ENSG00000237094:E11

原始答复：

sed 's/:[EI]/&_ /' foo.txt |  #separate the number at the end with a space
sort -k6 | sort -n -k7 |         #sort by code, then by [EI] number
sed 's/_ //'                  #remove the underscore space

我喜欢这样做，用占位符“保护”字符串以隔离我感兴趣的内容，然后稍后替换它们

近距离：

sed 's/:[EI]/_ &_ /' foo.txt | sort -n -k8 | sort -k6,6 | sed 's/_ //g'

但这天真地假设排序是以一种非常特定的方式工作的，而不是。。。所以有时候E2会在E1之前出现

我不确定单独使用sort是否可以完成，awk可能是一种方式……

原始答案：

sed 's/:[EI]/&_ /' foo.txt |  #separate the number at the end with a space
sort -k6 | sort -n -k7 |         #sort by code, then by [EI] number
sed 's/_ //'                  #remove the underscore space

我喜欢这样做，用占位符“保护”字符串以隔离我感兴趣的内容，然后稍后替换它们

近距离：

sed 's/:[EI]/_ &_ /' foo.txt | sort -n -k8 | sort -k6,6 | sed 's/_ //g'

但这天真地假设排序是以一种非常特定的方式工作的，而不是。。。所以有时候E2会在E1之前出现

我不确定单靠sort就可以完成，awk可能是一种方式……

因此我回到这个问题，并编写了一些python代码，实际完成了这项任务：

#!/usr/bin/env python

import sys
import re
from collections import defaultdict

#loop through args
for thisarg in sys.argv[1:]:
    #initialize a defualt dict
    bysign = defaultdict(list)

    #read the file
    try:
        thisfile = open(thisarg,'r')
        for line in thisfile:
            #split each line by space and colon
            dat = re.split('[ :]*',line.strip())
            #append line to dictionary indexed by ENSG code
            bysign[dat[-2]].append(line.strip())
        thisfile.close()
    except IOError:
        print "no such file {:}".format(thisarg)

    #extract the keys from the dictionary
    mykeys = bysign.keys()
    #sort the keys
    mykeys.sort()
    for key in mykeys:
        #initialize another, smaller dictionary
        bytuple = dict()
        #loop through all the lines that have the same ENSG code
        group = bysign[key]
        for line in group:
            #extract the E/I code
            ei=line.split(':')[-1]
            #convert the E/I code to a (char,int) tuple
            letter = ei[0]
            number = int(ei[1:])
            #use that tuple to index the smaller dict
            bytuple[(letter,number)] = line
        #extract the keys from the sub-dictionary
        eikeys = bytuple.keys()
        #sort the keys
        eikeys.sort()
        #print the results
        for k in eikeys:
            print bytuple[k]

我希望你现在已经明白了。我很好奇是否有人关心改进我的python。

所以我回到这个问题，并编写了一些python代码来实际完成这项任务：

#!/usr/bin/env python

import sys
import re
from collections import defaultdict

#loop through args
for thisarg in sys.argv[1:]:
    #initialize a defualt dict
    bysign = defaultdict(list)

    #read the file
    try:
        thisfile = open(thisarg,'r')
        for line in thisfile:
            #split each line by space and colon
            dat = re.split('[ :]*',line.strip())
            #append line to dictionary indexed by ENSG code
            bysign[dat[-2]].append(line.strip())
        thisfile.close()
    except IOError:
        print "no such file {:}".format(thisarg)

    #extract the keys from the dictionary
    mykeys = bysign.keys()
    #sort the keys
    mykeys.sort()
    for key in mykeys:
        #initialize another, smaller dictionary
        bytuple = dict()
        #loop through all the lines that have the same ENSG code
        group = bysign[key]
        for line in group:
            #extract the E/I code
            ei=line.split(':')[-1]
            #convert the E/I code to a (char,int) tuple
            letter = ei[0]
            number = int(ei[1:])
            #use that tuple to index the smaller dict
            bytuple[(letter,number)] = line
        #extract the keys from the sub-dictionary
        eikeys = bytuple.keys()
        #sort the keys
        eikeys.sort()
        #print the results
        for k in eikeys:
            print bytuple[k]

我希望你现在已经明白了。好奇是否有人关心改进我的python。

感谢@scf的解决方案，但是当我对k7进行绝对排序时，它在E2和I2之前给了我E和i11，因此我对命令做了一个小改动，即sed's/：[EI]/&/'file | sort-k6 | sort-V-k7 | sed's/'。再次感谢。实际上，它对某些行仍然不起作用：chr10 101380367 101418901 38534-ensg0000155287:I1 chr10 101380812 101381675 863+ensg0000260475:E1 chr10 101411968 101413581 1613-ensg0000229278:I1 chr10 101413582 101413662 80-ensg000029278:E1 chr10 101418902 101418994 92-ensg0000287:E1在我的sort版本上是

sort-n

现在更新答案我没有安静地收到你最后的评论我不在电脑前，所以我没有手册页来检查-V做了什么，我应该在我的答案中使用

sort-n

：-n，--数字排序根据字符串数值比较感谢@scf的解决方案，然而，当我对k7进行绝对排序时，它在E2和I2之前给了我E和i11，因此我在命令中做了一个小小的更改，即sed's/：[EI]/&|/'file | sort-k6 | sort-V-k7 | sed's/|。再次感谢。实际上，它对某些行仍然不起作用：chr10 101380367 101418901 38534-ensg0000155287:I1 chr10 101380812 101381675 863+ensg0000260475:E1 chr10 101411968 101413581 1613-ensg0000229278:I1 chr10 101413582 101413662 80-ensg000029278:E1 chr10 101418902 101418994 92-ensg0000287:E1在我的sort版本上是

sort-n

现在更新答案我没有安静地收到你最后的评论我不在电脑前，所以我没有手册页来检查-V做了什么，我应该在我的答案中使用

sort-n

：-n，--根据字符串数值进行数字排序比较