如何在bash中使用awk或sed将计算出的输出附加到以制表符分隔的文本文件的行尾? 背景

如何在bash中使用awk或sed将计算出的输出附加到以制表符分隔的文本文件的行尾? 背景,awk,sed,Awk,Sed,我正在使用一个以制表符分隔的文本文件(GENCODE GTF文件),希望计算每行列5和4之间的差异,然后将结果分别附加到每行的末尾 这是2604491行文件的示例: chr1 HAVANA gene 11869 14409 . + . gene_id "ENSG00000223972.5"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; level

我正在使用一个以制表符分隔的文本文件(GENCODE GTF文件),希望计算每行列54之间的差异,然后将结果分别附加到每行的末尾

这是2604491行文件的示例:

chr1    HAVANA  gene    11869   14409   .       +       .       gene_id "ENSG00000223972.5"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; level 2; havana_gene "OTTHUMG00000000961.2";
chr1    HAVANA  transcript      11869   14409   .       +       .       gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "DDX11L1-002"; level 2; transcript_support_level "1"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
chr1    HAVANA  exon    11869   12227   .       +       .       gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "DDX11L1-002"; exon_number 1; exon_id "ENSE00002234944.1"; level 2; transcript_support_level "1"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
chr1    HAVANA  exon    12613   12721   .       +       .       gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "DDX11L1-002"; exon_number 2; exon_id "ENSE00003582793.1"; level 2; transcript_support_level "1"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
chr1    HAVANA  exon    13221   14409   .       +       .       gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "DDX11L1-002"; exon_number 3; exon_id "ENSE00002312635.1"; level 2; transcript_support_level "1"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
chr1    HAVANA  transcript      12010   13670   .       +       .       gene_id "ENSG00000223972.5"; transcript_id "ENST00000450305.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-001"; level 2; transcript_support_level "NA"; ont "PGO:0000005"; ont "PGO:0000019"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000002844.2";
chr1    HAVANA  exon    12010   12057   .       +       .       gene_id "ENSG00000223972.5"; transcript_id "ENST00000450305.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-001"; exon_number 1; exon_id "ENSE00001948541.1"; level 2; transcript_support_level "NA"; ont "PGO:0000005"; ont "PGO:0000019"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000002844.2";
这是给定上述示例的所需输出(滚动到两个代码示例的末尾以查看更改):

我的尝试 以下是我尝试过的:

# this file has a header 5 lines long, so the calculation should start on the 6th line
awk 'NR>=6 {$10=$5-$4} 1' gencode.gtf > output.txt
此代码用第9列开头的差异计算替换了
gene_id
“ensg000022972.5;”“
(我希望它会创建第10列,因为这应该是一个9列文件)。例如:


基于评论的最佳/修订答案:

awk -F"\t" 'BEGIN {OFS="\t"} NR>=6 {$10=$5-$4} 1' gencode.gtf > output.txt

原始答案:

awk -F"\t" 'BEGIN {OFS="\t"} NR>=6 {$10=$5-$4} 1' gencode.gtf > output.txt
事实证明,这是一个简单的修复

有效的代码需要制表符分隔符限定。即:

awk -F"\t" 'NR>=6 {$10=$5-$4} 1' gencode.gtf > output.txt
输出(滚动到代码示例的末尾以查看计算的添加):


但是,我注意到文件的样式确实略有变化。现在,字段之间的空间越来越小,这似乎很奇怪。我不相信它会影响它的功能,但它确实会降低它的可读性。

只需在记录末尾添加您想要的值:

awk -F'\t' 'NR>5{$0=$0 FS ($5-$4)} 1' file

我也欢迎完全不改变文件的替代解决方案。也许有一种更有效的方法可以在不改变文件的情况下进行计算并附加它。您必须将输出字段分隔符
OFS
也设置为选项卡,或者将其替换为空白。这将解决所有问题。谢谢@BenjaminW
awk-F“\t”'BEGIN{OFS=“\t”}NR>=6{$10=$5-$4}1'gencode.gtf>output.txt
很高兴您解决了您的问题,但在将来,请减少数据的大小,以捕获基本问题和任何角落案例的一个示例。我们不需要看到屏幕上滚动的数据来帮助您解决问题。很好的Q/A,因为你包含了你的代码,所以以后只需让你的帖子更紧凑一点;-)(请)。。。。祝你好运。@Sheller,感谢你的反馈!!我会将数据示例的大小编辑得更短。这样就没有那么多杂乱的东西了:)非常优雅。这也很好用。感谢您的贡献@Ed Morton!
chr1 HAVANA gene 11869 14409 . + . gene_id "ENSG00000223972.5"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; level 2; havana_gene "OTTHUMG00000000961.2"; 2540
chr1 HAVANA transcript 11869 14409 . + . gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "DDX11L1-002"; level 2; transcript_support_level "1"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1"; 2540
chr1 HAVANA exon 11869 12227 . + . gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "DDX11L1-002"; exon_number 1; exon_id "ENSE00002234944.1"; level 2; transcript_support_level "1"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1"; 358
chr1 HAVANA exon 12613 12721 . + . gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "DDX11L1-002"; exon_number 2; exon_id "ENSE00003582793.1"; level 2; transcript_support_level "1"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1"; 108
chr1 HAVANA exon 13221 14409 . + . gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "DDX11L1-002"; exon_number 3; exon_id "ENSE00002312635.1"; level 2; transcript_support_level "1"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1"; 1188
chr1 HAVANA transcript 12010 13670 . + . gene_id "ENSG00000223972.5"; transcript_id "ENST00000450305.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-001"; level 2; transcript_support_level "NA"; ont "PGO:0000005"; ont "PGO:0000019"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000002844.2"; 1660
chr1 HAVANA exon 12010 12057 . + . gene_id "ENSG00000223972.5"; transcript_id "ENST00000450305.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-001"; exon_number 1; exon_id "ENSE00001948541.1"; level 2; transcript_support_level "NA"; ont "PGO:0000005"; ont "PGO:0000019"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000002844.2"; 47
awk -F'\t' 'NR>5{$0=$0 FS ($5-$4)} 1' file