Awk 基于列上的模式计算行数
我有一个像这样的数据集Awk 基于列上的模式计算行数,awk,libreoffice-calc,Awk,Libreoffice Calc,我有一个像这样的数据集 pdbid ch spacegroup ph uniprotacc name 5TUE A P 21 21 21 A0A059WYP6 Tetracycline destructase Tet(50) 5TUE B P 21 21 21 A0A059WYP6 Tetracycline destructase Tet(50) 5TUF A P 21 21 21 A0A059WYP6 Tetracy
pdbid ch spacegroup ph uniprotacc name
5TUE A P 21 21 21 A0A059WYP6 Tetracycline destructase Tet(50)
5TUE B P 21 21 21 A0A059WYP6 Tetracycline destructase Tet(50)
5TUF A P 21 21 21 A0A059WYP6 Tetracycline destructase Tet(50)
5TUF B P 21 21 21 A0A059WYP6 Tetracycline destructase Tet(50)
5TUI A P 21 21 21 A0A059WYP6 Tetracycline destructase Tet(50)
5TUI B P 21 21 21 A0A059WYP6 Tetracycline destructase Tet(50)
6J3M A F 41 3 2 A0A059ZFC5 Phosphopantetheine adenylyltransferase
6JNH A F 41 3 2 A0A059ZFC5 Phosphopantetheine adenylyltransferase
6JOG A F 41 3 2 5.6 A0A059ZFC5 Phosphopantetheine adenylyltransferase
4BRZ A P 1 21 1 7 A0A067XG63 HALOALKANE DEHALOGENASE
4BRZ B P 1 21 1 7 A0A067XG63 HALOALKANE DEHALOGENASE
4C6H A P 21 21 2 A0A067XG66 HALOALKANE DEHALOGENASE
我想根据第5列(uniprotacc)统计每个条目。输出应该是这样的
pdbid ch spacegroup ph uniprotacc newval name
5TUE A P 21 21 21 A0A059WYP6 1 Tetracycline destructase Tet(50)
5TUE B P 21 21 21 A0A059WYP6 2 Tetracycline destructase Tet(50)
5TUF A P 21 21 21 A0A059WYP6 3 Tetracycline destructase Tet(50)
5TUF B P 21 21 21 A0A059WYP6 4 Tetracycline destructase Tet(50)
5TUI A P 21 21 21 A0A059WYP6 5 Tetracycline destructase Tet(50)
5TUI B P 21 21 21 A0A059WYP6 6 Tetracycline destructase Tet(50)
6J3M A F 41 3 2 A0A059ZFC5 1 Phosphopantetheine adenylyltransferase
6JNH A F 41 3 2 A0A059ZFC5 2 Phosphopantetheine adenylyltransferase
6JOG A F 41 3 2 5.6 A0A059ZFC5 3 Phosphopantetheine adenylyltransferase
4BRZ A P 1 21 1 7 A0A067XG63 1 HALOALKANE DEHALOGENASE
4BRZ B P 1 21 1 7 A0A067XG63 2 HALOALKANE DEHALOGENASE
4C6H A P 21 21 2 A0A067XG66 1 HALOALKANE DEHALOGENASE
我不知道,我想也许awk甚至libreoffice calc可以轻松完成这项工作。但任何帮助都是非常感谢的
文件是一个由选项卡分隔的文件。在Calc中,将此公式放在第二行的
F列(newval列)中,并向下拖动以填充
=IF(E2=E1;F1+1;1)
下面是一个awk
脚本解决方案
script.awk
运行:
输出:
脚本说明:
数据的图像对我们毫无帮助。当然,使用文本示例代替。我从libreofficecalc复制数据,但不知何故,我将其更改为图像。整个数据集在这里@murpholinox,请仅以文本格式发布数据,尽管我们理解您的担忧,即数据正在以某种方式更改为图像(不确定如何更改),但看到图像中的样本对于看到此问题的人来说是非常痛苦的,因此为了问题的质量,请尝试仅以文本格式发布,干杯。完成。非常抱歉。@murpholinox,您不必为我们在这里学习而感到抱歉,干杯,再次感谢您将示例更改为文本。
BEGIN {FS = OFS = "\t"}
NR==1 {$NF = "newval" OFS $NF}
NR>1 {$NF = ++seen[$(NF - 1)] OFS $NF}
1
awk -f script.awk input.tsv
pdbid ch spacegroup ph uniprotacc newval name
5TUE A P 21 21 21 A0A059WYP6 1 Tetracycline destructase Tet(50)
5TUE B P 21 21 21 A0A059WYP6 2 Tetracycline destructase Tet(50)
5TUF A P 21 21 21 A0A059WYP6 3 Tetracycline destructase Tet(50)
5TUF B P 21 21 21 A0A059WYP6 4 Tetracycline destructase Tet(50)
5TUI A P 21 21 21 A0A059WYP6 5 Tetracycline destructase Tet(50)
5TUI B P 21 21 21 A0A059WYP6 6 Tetracycline destructase Tet(50)
6J3M A F 41 3 2 A0A059ZFC5 1 Phosphopantetheine adenylyltransferase
6JNH A F 41 3 2 A0A059ZFC5 2 Phosphopantetheine adenylyltransferase
6JOG A F 41 3 2 5.6 A0A059ZFC5 3 Phosphopantetheine adenylyltransferase
4BRZ A P 1 21 1 7 A0A067XG63 1 HALOALKANE DEHALOGENASE
4BRZ B P 1 21 1 7 A0A067XG63 2 HALOALKANE DEHALOGENASE
4C6H A P 21 21 2 A0A067XG66 1 HALOALKANE DEHALOGENASE
BEGIN { # pre processing
FS = "\t"; # assign input field separator to "\t" tab
OFS = "\t"; # assign ouput field separator to "\t" tab
}
NR==1 { # processing first line
# $NF is the last field in input line
$NF = "newval" OFS $NF; # prefix last field with "newval" and tab
}
NR>1 { # processing non first line
# $(NF - 1) is the befroe last field in input line. Such as A0A059WYP6
# seen[$(NF - 1)] is an array couning the occurance of $(NF - 1)
# ++seen[$(NF - 1)] is an incremented array couning the occurance of $(NF - 1)
$NF = ++seen[$(NF - 1)] OFS $NF; # prefix last field with an incremented array couning the occurance of $(NF - 1) and tab
}
{print} # print every processed line