Awk 从同一行中提取以特定字符开头的字符串

Awk 从同一行中提取以特定字符开头的字符串,awk,sed,grep,string-matching,Awk,Sed,Grep,String Matching,首先,如果我在发帖前没有遇到类似的问题和答案,我表示歉意。我有一套(72)个基因注释文件。我想以如下格式提取GO术语(奖金将是其他注释术语) HORVU1Hr1G002090 GO:0003824 HORVU1Hr1G002090 GO:0006527 HORVU1Hr1G002090 GO:0008295 HORVU1Hr1G002090 GO:0008792 HORVU1Hr1G005360

首先,如果我在发帖前没有遇到类似的问题和答案,我表示歉意。我有一套(72)个基因注释文件。我想以如下格式提取GO术语(奖金将是其他注释术语)

HORVU1Hr1G002090           GO:0003824
HORVU1Hr1G002090           GO:0006527
HORVU1Hr1G002090           GO:0008295
HORVU1Hr1G002090           GO:0008792
HORVU1Hr1G005360           GO:0004497
HORVU1Hr1G005360           GO:0005506
HORVU1Hr1G005360           GO:0016705
HORVU1Hr1G005360           GO:0020037
HORVU1Hr1G005360           GO:0055114
HORVU1Hr1G087600           GO:0009055
HORVU1Hr1G087600           GO:0015035
HORVU1Hr1G087600           GO:0016705
.
.
.
我的输入文件如下所示:

HORVU1Hr1G002090.11 HORVU1Hr1G002090    chr1H:4283580-4286133   HC_G    arginine decarboxylase 1    GO:0003824, GO:0006527, GO:0008295, GO:0008792  PF00278, PF02784    IPR000183, IPR002985, IPR009006, IPR022643, IPR022644, IPR022657, IPR029066 HORVU1Hr1G002090
HORVU1Hr1G005360.1  HORVU1Hr1G005360    chr1H:11579708-11582804 HC_G    Cytochrome P450 superfamily protein GO:0004497, GO:0005506, GO:0016705, GO:0020037, GO:0055114  PF00067 IPR001128, IPR002403, IPR017972    HORVU1Hr1G005360
HORVU1Hr1G087600.1  HORVU1Hr1G087600    chr1H:539679073-539680597   HC_G    Glutaredoxin family protein GO:0009055, GO:0015035, GO:0045454  PF00462 IPR002109, IPR011905, IPR012336 HORVU1Hr1G087600
HORVU1Hr1G087620.1  HORVU1Hr1G087620    chr1H:539699799-539703594   HC_G    S-adenosyl-L-methionine-dependent methyltransferases superfamily protein    none    PF10294 IPR019410, IPR029063 HORVU1Hr1G087620
HORVU1Hr1G089380.1  HORVU1Hr1G089380    chr1H:543801190-543806492   HC_G    Subtilisin-like protease    GO:0004252, GO:0006508  PF00082, PF05922    IPR000209, IPR010259, IPR015500, IPR023828 HORVU1Hr1G089380
HORVU1Hr1G093570.2  HORVU1Hr1G093570    chr1H:553490639-553492292   HC_G    Ribonuclease T2 family protein  GO:0003723, GO:0033897  PF00445 IPR001568 HORVU1Hr1G093570
HORVU1Hr1G093660.11 HORVU1Hr1G093660    chr1H:553651123-553709366   HC_G    ribonuclease 3  GO:0003723, GO:0033897  PF00445 IPR001568 HORVU1Hr1G093660
HORVU1Hr1G094970.1  HORVU1Hr1G094970    chr1H:556830249-556834411   HC_G    Mitochondrial outer membrane protein porin 5    none    none    IPR023614 HORVU1Hr1G094970
HORVU1Hr1G016140.3  HORVU1Hr1G016140    chr1H:49798715-49799683 HC_u    undescribed protein none    none    none HORVU1Hr1G016140
awk '{for(i=1;i<=NF;i++){if($i~"^GO:"){gsub(",","");print $2,$i}}}' input
我非常肯定
grep-oP
可以做到这一点,或者更好的
awk
sed
。但我不能让它工作。。。请帮忙

我尝试了以下方法,但没有成功:

grep -oP 'GO.*\[,|\|]\K+' input_file
grep -oP 'GO.*+' input_file #prints the whole line after the first match
sed -n 's/.*\(GO:[[:alnum:]]*\).*,/\1/p' input_file
grep -oP 'GO:.*' input_file | tr ',' '|'| tr ' ' '|' | tr '\t' '|'| awk 'BEGIN {FS=OFS="|"}; {for(i=10;i<NF;i++){if($i~/^GO/){a=$i}} print $(NF),a}'
grep-oP'GO.*\[,\\\\\]\K+'输入文件
grep-oP'GO.*+'输入文件#在第一次匹配后打印整行
sed-n的/*\(GO:[[:alnum:]*\).*,/\1/p的输入文件

grep-oP'GO:.''input_file | tr','''|'|'|'| tr'\t'|'| awk'BEGIN{FS=OFS=“|”};{对于(i=10;i尝试使用
awk
如下:

HORVU1Hr1G002090.11 HORVU1Hr1G002090    chr1H:4283580-4286133   HC_G    arginine decarboxylase 1    GO:0003824, GO:0006527, GO:0008295, GO:0008792  PF00278, PF02784    IPR000183, IPR002985, IPR009006, IPR022643, IPR022644, IPR022657, IPR029066 HORVU1Hr1G002090
HORVU1Hr1G005360.1  HORVU1Hr1G005360    chr1H:11579708-11582804 HC_G    Cytochrome P450 superfamily protein GO:0004497, GO:0005506, GO:0016705, GO:0020037, GO:0055114  PF00067 IPR001128, IPR002403, IPR017972    HORVU1Hr1G005360
HORVU1Hr1G087600.1  HORVU1Hr1G087600    chr1H:539679073-539680597   HC_G    Glutaredoxin family protein GO:0009055, GO:0015035, GO:0045454  PF00462 IPR002109, IPR011905, IPR012336 HORVU1Hr1G087600
HORVU1Hr1G087620.1  HORVU1Hr1G087620    chr1H:539699799-539703594   HC_G    S-adenosyl-L-methionine-dependent methyltransferases superfamily protein    none    PF10294 IPR019410, IPR029063 HORVU1Hr1G087620
HORVU1Hr1G089380.1  HORVU1Hr1G089380    chr1H:543801190-543806492   HC_G    Subtilisin-like protease    GO:0004252, GO:0006508  PF00082, PF05922    IPR000209, IPR010259, IPR015500, IPR023828 HORVU1Hr1G089380
HORVU1Hr1G093570.2  HORVU1Hr1G093570    chr1H:553490639-553492292   HC_G    Ribonuclease T2 family protein  GO:0003723, GO:0033897  PF00445 IPR001568 HORVU1Hr1G093570
HORVU1Hr1G093660.11 HORVU1Hr1G093660    chr1H:553651123-553709366   HC_G    ribonuclease 3  GO:0003723, GO:0033897  PF00445 IPR001568 HORVU1Hr1G093660
HORVU1Hr1G094970.1  HORVU1Hr1G094970    chr1H:556830249-556834411   HC_G    Mitochondrial outer membrane protein porin 5    none    none    IPR023614 HORVU1Hr1G094970
HORVU1Hr1G016140.3  HORVU1Hr1G016140    chr1H:49798715-49799683 HC_u    undescribed protein none    none    none HORVU1Hr1G016140
awk '{for(i=1;i<=NF;i++){if($i~"^GO:"){gsub(",","");print $2,$i}}}' input

awk'{for(i=1;i尝试使用
awk
如下:

HORVU1Hr1G002090.11 HORVU1Hr1G002090    chr1H:4283580-4286133   HC_G    arginine decarboxylase 1    GO:0003824, GO:0006527, GO:0008295, GO:0008792  PF00278, PF02784    IPR000183, IPR002985, IPR009006, IPR022643, IPR022644, IPR022657, IPR029066 HORVU1Hr1G002090
HORVU1Hr1G005360.1  HORVU1Hr1G005360    chr1H:11579708-11582804 HC_G    Cytochrome P450 superfamily protein GO:0004497, GO:0005506, GO:0016705, GO:0020037, GO:0055114  PF00067 IPR001128, IPR002403, IPR017972    HORVU1Hr1G005360
HORVU1Hr1G087600.1  HORVU1Hr1G087600    chr1H:539679073-539680597   HC_G    Glutaredoxin family protein GO:0009055, GO:0015035, GO:0045454  PF00462 IPR002109, IPR011905, IPR012336 HORVU1Hr1G087600
HORVU1Hr1G087620.1  HORVU1Hr1G087620    chr1H:539699799-539703594   HC_G    S-adenosyl-L-methionine-dependent methyltransferases superfamily protein    none    PF10294 IPR019410, IPR029063 HORVU1Hr1G087620
HORVU1Hr1G089380.1  HORVU1Hr1G089380    chr1H:543801190-543806492   HC_G    Subtilisin-like protease    GO:0004252, GO:0006508  PF00082, PF05922    IPR000209, IPR010259, IPR015500, IPR023828 HORVU1Hr1G089380
HORVU1Hr1G093570.2  HORVU1Hr1G093570    chr1H:553490639-553492292   HC_G    Ribonuclease T2 family protein  GO:0003723, GO:0033897  PF00445 IPR001568 HORVU1Hr1G093570
HORVU1Hr1G093660.11 HORVU1Hr1G093660    chr1H:553651123-553709366   HC_G    ribonuclease 3  GO:0003723, GO:0033897  PF00445 IPR001568 HORVU1Hr1G093660
HORVU1Hr1G094970.1  HORVU1Hr1G094970    chr1H:556830249-556834411   HC_G    Mitochondrial outer membrane protein porin 5    none    none    IPR023614 HORVU1Hr1G094970
HORVU1Hr1G016140.3  HORVU1Hr1G016140    chr1H:49798715-49799683 HC_u    undescribed protein none    none    none HORVU1Hr1G016140
awk '{for(i=1;i<=NF;i++){if($i~"^GO:"){gsub(",","");print $2,$i}}}' input
awk'{for(i=1;i