Unix 如何合并以特定子字符串开头的行？_Unix_Awk_Printf_Substr

Unix 如何合并以特定子字符串开头的行？

unix awk

Unix 如何合并以特定子字符串开头的行？,unix,awk,printf,substr,Unix,Awk,Printf,Substr,我有一个这样的文件 $ head test gene=ENSECAG00000012421 note="synaptonemal complex central element protein 1 [Source:HGNC Symbol;Acc:28852]" gene=ENSECAG00000017803

我有一个这样的文件

$ head test
                     gene=ENSECAG00000012421
                     note="synaptonemal complex central element protein 1
                     [Source:HGNC Symbol;Acc:28852]"
                     gene=ENSECAG00000017803
                     note="Uncharacterized protein
                     [Source:UniProtKB/TrEMBL;Acc:F6SNR9]"
                     gene=ENSECAG00000019088
                     note="cytochrome P450 2E1  [Source:RefSeq
                     peptide;Acc:NP_001104773]"
                     gene=ENSECAG00000004229

我想让这个文件看起来像这样

ENSECAG00000012421    synaptonemal complex central element protein 1 [Source:HGNC Symbol;Acc:28852]
ENSECAG00000017803    Uncharacterized protein [Source:UniProtKB/TrEMBL;Acc:F6SNR9]

我不确定便条是否总是两行，所以我想要一些与

awk '{if(substr($1,1,4)=="gene") gene=$1; else print gene,$1}'

但我想让它认识到，它可能有两行，而且在单词之间有空格。因此，我希望它将“”中的所有内容打印为第2列（理想情况下，通过\t分隔这两列，以便以后不会混淆）我知道如何去除基因、注释和“标记”，但不确定它们是否有助于鉴定。我很高兴它是一系列不同的命令，首先把整个音符放在一行，然后把它和gene或所有的东西结合在一起，不管什么效果最好

另外，如果您使用的是awk，您能简单解释一下您正在做什么吗

谢谢你的帮助！

可能太复杂了，但这里有一种方法：

/^\s*gene=/  { gene=substr($1, 6) }
/^\s*note=/  { note=substr($0, 28) }
/"$/         { if (substr($1,1,4)=="note")
                 print gene, substr($0, 28, length($0)-28);
               else
                 print gene, note, substr($0, 22, length($0)-22) }

请注意，这将同时处理单行和双线注释。

使用awk

awk 'BEGIN{FS="\n";RS="gene="}{gsub(/(note=|\")/,"");print $1,$2,$3}' file|awk '$1=$1'

ENSECAG00000012421 synaptonemal complex central element protein 1 [Source:HGNC Symbol;Acc:28852]
ENSECAG00000017803 Uncharacterized protein [Source:UniProtKB/TrEMBL;Acc:F6SNR9]
ENSECAG00000019088 cytochrome P450 2E1 [Source:RefSeq peptide;Acc:NP_001104773]
ENSECAG00000004229

给出：

如果您有

GNU awk

或

mawk

（该解决方案依赖于基于正则表达式的输入记录分隔符，严格遵守POSIX或更早的

awk

实现不支持该分隔符）：

短版：

awk -v RS=' *(gene=|note="|")' '
  { gsub("\n", ""); if ($0 == "") next; $1=$1; 
    printf "%s%s", $0, (/^ENSECAG[0-9]+$/ ? "\t" : "\n") }
  ' file

注释版本：

awk -v RS=' *(gene=|note="|")' '
  { gsub("\n", ""); if ($0 == "") next; $1=$1; 
    printf "%s%s", $0, (/^ENSECAG[0-9]+$/ ? "\t" : "\n") }
  ' file

-v RS='*（gene=| note=“|”）

RS

是一个特殊变量，定义输入记录分隔符-指定一个正则表达式，它跨行将输入拆分为感兴趣的记录

awk -v RS=' *(gene=|note="|")' '
  {    
   gsub("\n", "");     # remove all newlines from record
   if ($0 == "") next  # ignore empty records
   $1=$1;              # rebuild record to compress multiple interior spaces
    # Output:
    #  - Is it a gene record, i.e. is there only 1 field that contains a gene name?
    #    Output it with just a trailing \t, but no trailing \n, so that the next
    #    note record will print on the same line.
    #  - Otherwise: a note record: print with trailing \n, effectively
    #    appending it to the previous gene record.
   printf "%s%s", $0, (/^ENSECAG[0-9]+$/ ? "\t" : "\n")
  }
  ' file

关闭，但OP希望

\t

将基因名称与注释分开（即，只有两个输出列，由

\t

分隔）。关闭，但OP希望

\t

将基因名称与注释分开（即，只有两个输出列，由

\t

分隔）。由于使用多字符，需要

gawk

或

mawk

。

RS

。这适用于特定输入（除了OP需要

\t

而不是第一个输出标记后的空格），但通用性不好；您可以使用

[:space:]使其更易于移植

而不是

\s

@mklement0，您的答案（我非常喜欢）也是针对“特定输入”（例如，它包含单词“note”和“gene”）。谢谢（我希望我的答案与所有

awk

一起工作，但它没有）；我指的是你答案中硬编码的字符位置。感谢你的回答和解释！你的代码对于音符跨越两行以上的行非常有效，但是当note=“processed_pseudogene”时，它会变得混乱，例如。/gene=ENSECAG00000005298/note=“processed_pseudogene”/gene=ENSECAG00000026864/note=”包含1的环核苷酸结合域[来源：HGNC Symbol；Acc:26663]“转变为：ENSCAG00000005298 ENSCAG00000026864包含1的环核苷酸结合域[来源：HGNC Symbol；Acc:26663]/基因=ENSCAG00000026236/注释=“U6剪接体RNA[来源：RFAM；Acc:RF00026]“也就是说，只有2行被正确地处理过。根据定义，任何只包含1个字段的记录都是基因行。解决方案是检查该记录的内容，以确定它是否是一个基因名。我不知道这个测试可以/应该有多具体，但我已经更新了我的答案，用一个特定的、基于正则表达式的测试来替换过于宽泛的测试（NF==1）。注意：原始的非空行测试

/\S/

无法在较旧的

gawk

mawk

版本中使用单行注释记录，因此我不得不稍微重写awk程序；希望现在对你有用。谢谢！还有一个bug，但那是我的错，因为我没有意识到我的/note也可以跨越3行，一旦我的测试文件正确，你的命令也可以完美地工作3行！而且，我所有的基因名称都以ENSECAG开头，后面是数字，所以这对于我的例子来说已经足够广泛了。干杯@用户3400409：我的荣幸；我很高兴我们最终解决了这个问题。

awk -v RS=' *(gene=|note="|")' '
  { gsub("\n", ""); if ($0 == "") next; $1=$1; 
    printf "%s%s", $0, (/^ENSECAG[0-9]+$/ ? "\t" : "\n") }
  ' file

awk -v RS=' *(gene=|note="|")' '
  {    
   gsub("\n", "");     # remove all newlines from record
   if ($0 == "") next  # ignore empty records
   $1=$1;              # rebuild record to compress multiple interior spaces
    # Output:
    #  - Is it a gene record, i.e. is there only 1 field that contains a gene name?
    #    Output it with just a trailing \t, but no trailing \n, so that the next
    #    note record will print on the same line.
    #  - Otherwise: a note record: print with trailing \n, effectively
    #    appending it to the previous gene record.
   printf "%s%s", $0, (/^ENSECAG[0-9]+$/ ? "\t" : "\n")
  }
  ' file