Bash 按第一列合并多个TSV文件_Bash_Perl_Awk

Bash 按第一列合并多个TSV文件

bash perl awk

Bash 按第一列合并多个TSV文件,bash,perl,awk,Bash,Perl,Awk,我在一个只有两列的目录中有许多（几十个）TSV文件，我想根据第一列的值合并所有这些文件（两列都有我需要维护的标题）；如果存在此值，则必须添加相应第二列的值，依此类推（参见示例）。文件可能有不同的行数，并且不按第一列排序，尽管这可以通过排序轻松完成我尝试了加入，但这只适用于两个文件。是否可以为目录中的所有文件展开联接？我认为awk可能是一个更好的解决方案，但我对awk的了解非常有限。有什么想法吗以下是仅三个文件的示例： S01.tsv Accesion S01 AJ863320

我在一个只有两列的目录中有许多（几十个）TSV文件，我想根据第一列的值合并所有这些文件（两列都有我需要维护的标题）；如果存在此值，则必须添加相应第二列的值，依此类推（参见示例）。文件可能有不同的行数，并且不按第一列排序，尽管这可以通过排序轻松完成

我尝试了加入，但这只适用于两个文件。是否可以为目录中的所有文件展开联接？我认为awk可能是一个更好的解决方案，但我对awk的了解非常有限。有什么想法吗

以下是仅三个文件的示例：

S01.tsv

Accesion    S01  
AJ863320    1  
AM930424    1  
AY664038    2

S02.tsv

Accesion    S02  
AJ863320    2  
AM930424    1  
EU236327    1  
EU434346    2 

S03.tsv

Accesion    S03  
AJ863320    5  
EU236327    2  
EU434346    2

输出文件应为：

    Accesion    S01   S02   S03  
    AJ863320    1     2     5  
    AM930424    1     1
    AY664038    2  
    EU236327          1     2  
    EU434346          2     2

好的，多亏了James Brown，我才使这段代码正常工作（我将其命名为compile.awk），但有一些小故障：

BEGIN { OFS="\t" }                            # tab separated columns
FNR==1 { f++ }                                # counter of files
{
    a[0][$1]=$1                               # reset the key for every record 
    for(i=2;i<=NF;i++)                        # for each non-key element
        a[f][$1]=a[f][$1] $i ( i==NF?"":OFS ) # combine them to array element
}
END {                                         # in the end
    for(i in a[0])                            # go thru every key
        for(j=0;j<=f;j++)                     # and all related array elements
            printf "%s%s", a[j][i], (j==f?ORS:OFS)
}                                             # output them, nonexistent will output empty

我得到的输出是：

LN854586.1.1236         1
JF128382.1.1303     1   
Accesion    S01 S02 S03
JN233077.1.1420 1       
HQ836180.1.1388     1   
KP718814.1.1338         1
JQ781640.1.1200         2

前两行不属于此处，因为文件应以所有文件的标题（第三行）开头。

有没有办法解决这个问题？

我可能会这样处理：

#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;

my @header; 
my %all_rows;
my %seen_cols;


#read STDIN or files specified as args. 
while ( <> ) {
   #detect a header row by keyword. 
   #can probably do this after 'open' but this way
   #means we can use <> and an arbitrary file list. 
   if ( m/^Accesion/ ) { 
      @header = split;       
      shift @header; #drop "accession" off the list so it's just S01,02,03 etc. 
      $seen_cols{$_}++ for @header; #keep track of uniques. 
   }
   else {
      #not a header row - split the row on whitespace.
      #can do /\t/ if that's not good enough, but it looks like it should be. 
      my ( $ID, @fields ) = split; 
      #use has slice to populate row.

      my %this_row;
      @this_row{@header} = @fields;

      #debugging
      print Dumper \%this_row; 

      #push each field onto the all rows hash. 
      foreach my $column ( @header ) {
         #append current to field, in case there's duplicates (no overwriting)
         $all_rows{$ID}{$column} .= $this_row{$column}; 
      }
   }
}

#print for debugging
print Dumper \%all_rows;
print Dumper \%seen_cols;

#grab list of column headings we've seen, and order them. 
my @cols_to_print = sort keys %seen_cols;

#print header row. 
print join "\t", "Accesion", @cols_to_print,"\n";
#iteate keys, and splice. 
foreach my $key ( sort keys %all_rows ) { 
    #print one row at a time.
    #map iterates all the columns, and gives the value or an empty string
    #if it's undefined. (prevents errors)
    print join "\t", $key, (map { $all_rows{$key}{$_} // '' } @cols_to_print),"\n"
}

你能展示一下（在问题中）到目前为止你已经尝试了什么吗？基本上是加入，尝试了一些grep，并且搜索了很多类似的东西，但是我没有什么可以实现或修改的，可能是因为我缺乏编码知识。Join完全符合我的要求，但只适用于两个文件。您可以在以下链接中使用

程序.awk

。根据您的需要修改

OFS

（

OFS=“\t”

我假设）。此外，输出记录顺序是随机的。使用3个文件连接

join-a1-a2-e”“-o0,1.2,2.2s01.tsvs02.tsv | join-a1-a2-e”“-o0,1.2,1.3,2.2-S03.tsv

。。。。。它没有按顺序输出，您需要对输出使用

排序

，或控制对

的扫描，请参阅<代码>awk-f compile.awk S*.tsv | sort

。另外，如果您的第一个字段太长，制表符无法完成，那么您需要

printf

。这个perl脚本非常有魅力！我只是评论一下最终版本的调试打印行，谢谢。

#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;

my @header; 
my %all_rows;
my %seen_cols;


#read STDIN or files specified as args. 
while ( <> ) {
   #detect a header row by keyword. 
   #can probably do this after 'open' but this way
   #means we can use <> and an arbitrary file list. 
   if ( m/^Accesion/ ) { 
      @header = split;       
      shift @header; #drop "accession" off the list so it's just S01,02,03 etc. 
      $seen_cols{$_}++ for @header; #keep track of uniques. 
   }
   else {
      #not a header row - split the row on whitespace.
      #can do /\t/ if that's not good enough, but it looks like it should be. 
      my ( $ID, @fields ) = split; 
      #use has slice to populate row.

      my %this_row;
      @this_row{@header} = @fields;

      #debugging
      print Dumper \%this_row; 

      #push each field onto the all rows hash. 
      foreach my $column ( @header ) {
         #append current to field, in case there's duplicates (no overwriting)
         $all_rows{$ID}{$column} .= $this_row{$column}; 
      }
   }
}

#print for debugging
print Dumper \%all_rows;
print Dumper \%seen_cols;

#grab list of column headings we've seen, and order them. 
my @cols_to_print = sort keys %seen_cols;

#print header row. 
print join "\t", "Accesion", @cols_to_print,"\n";
#iteate keys, and splice. 
foreach my $key ( sort keys %all_rows ) { 
    #print one row at a time.
    #map iterates all the columns, and gives the value or an empty string
    #if it's undefined. (prevents errors)
    print join "\t", $key, (map { $all_rows{$key}{$_} // '' } @cols_to_print),"\n"
}

Accesion    S01 S02 S03 
AJ863320    1   2   5   
AM930424    1   1       
AY664038    2           
EU236327        1   2   
EU434346        2   2