Bash 按第一列合并多个TSV文件
我在一个只有两列的目录中有许多(几十个)TSV文件,我想根据第一列的值合并所有这些文件(两列都有我需要维护的标题);如果存在此值,则必须添加相应第二列的值,依此类推(参见示例)。文件可能有不同的行数,并且不按第一列排序,尽管这可以通过排序轻松完成 我尝试了加入,但这只适用于两个文件。是否可以为目录中的所有文件展开联接?我认为awk可能是一个更好的解决方案,但我对awk的了解非常有限。有什么想法吗 以下是仅三个文件的示例:Bash 按第一列合并多个TSV文件,bash,perl,awk,Bash,Perl,Awk,我在一个只有两列的目录中有许多(几十个)TSV文件,我想根据第一列的值合并所有这些文件(两列都有我需要维护的标题);如果存在此值,则必须添加相应第二列的值,依此类推(参见示例)。文件可能有不同的行数,并且不按第一列排序,尽管这可以通过排序轻松完成 我尝试了加入,但这只适用于两个文件。是否可以为目录中的所有文件展开联接?我认为awk可能是一个更好的解决方案,但我对awk的了解非常有限。有什么想法吗 以下是仅三个文件的示例: S01.tsv Accesion S01 AJ863320
S01.tsv
Accesion S01
AJ863320 1
AM930424 1
AY664038 2
S02.tsv
Accesion S02
AJ863320 2
AM930424 1
EU236327 1
EU434346 2
S03.tsv
Accesion S03
AJ863320 5
EU236327 2
EU434346 2
输出文件应为:
Accesion S01 S02 S03
AJ863320 1 2 5
AM930424 1 1
AY664038 2
EU236327 1 2
EU434346 2 2
好的,多亏了James Brown,我才使这段代码正常工作(我将其命名为compile.awk),但有一些小故障:
BEGIN { OFS="\t" } # tab separated columns
FNR==1 { f++ } # counter of files
{
a[0][$1]=$1 # reset the key for every record
for(i=2;i<=NF;i++) # for each non-key element
a[f][$1]=a[f][$1] $i ( i==NF?"":OFS ) # combine them to array element
}
END { # in the end
for(i in a[0]) # go thru every key
for(j=0;j<=f;j++) # and all related array elements
printf "%s%s", a[j][i], (j==f?ORS:OFS)
} # output them, nonexistent will output empty
我得到的输出是:
LN854586.1.1236 1
JF128382.1.1303 1
Accesion S01 S02 S03
JN233077.1.1420 1
HQ836180.1.1388 1
KP718814.1.1338 1
JQ781640.1.1200 2
前两行不属于此处,因为文件应以所有文件的标题(第三行)开头。
有没有办法解决这个问题?我可能会这样处理:
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
my @header;
my %all_rows;
my %seen_cols;
#read STDIN or files specified as args.
while ( <> ) {
#detect a header row by keyword.
#can probably do this after 'open' but this way
#means we can use <> and an arbitrary file list.
if ( m/^Accesion/ ) {
@header = split;
shift @header; #drop "accession" off the list so it's just S01,02,03 etc.
$seen_cols{$_}++ for @header; #keep track of uniques.
}
else {
#not a header row - split the row on whitespace.
#can do /\t/ if that's not good enough, but it looks like it should be.
my ( $ID, @fields ) = split;
#use has slice to populate row.
my %this_row;
@this_row{@header} = @fields;
#debugging
print Dumper \%this_row;
#push each field onto the all rows hash.
foreach my $column ( @header ) {
#append current to field, in case there's duplicates (no overwriting)
$all_rows{$ID}{$column} .= $this_row{$column};
}
}
}
#print for debugging
print Dumper \%all_rows;
print Dumper \%seen_cols;
#grab list of column headings we've seen, and order them.
my @cols_to_print = sort keys %seen_cols;
#print header row.
print join "\t", "Accesion", @cols_to_print,"\n";
#iteate keys, and splice.
foreach my $key ( sort keys %all_rows ) {
#print one row at a time.
#map iterates all the columns, and gives the value or an empty string
#if it's undefined. (prevents errors)
print join "\t", $key, (map { $all_rows{$key}{$_} // '' } @cols_to_print),"\n"
}
你能展示一下(在问题中)到目前为止你已经尝试了什么吗?基本上是加入,尝试了一些grep,并且搜索了很多类似的东西,但是我没有什么可以实现或修改的,可能是因为我缺乏编码知识。Join完全符合我的要求,但只适用于两个文件。您可以在以下链接中使用
程序.awk
。根据您的需要修改OFS
(OFS=“\t”
我假设)。此外,输出记录顺序是随机的。使用3个文件连接join-a1-a2-e”“-o0,1.2,2.2s01.tsvs02.tsv | join-a1-a2-e”“-o0,1.2,1.3,2.2-S03.tsv
。。。。。它没有按顺序输出,您需要对输出使用排序
,或控制对的扫描,请参阅<代码>awk-f compile.awk S*.tsv | sort
。另外,如果您的第一个字段太长,制表符无法完成,那么您需要printf
。这个perl脚本非常有魅力!我只是评论一下最终版本的调试打印行,谢谢。
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
my @header;
my %all_rows;
my %seen_cols;
#read STDIN or files specified as args.
while ( <> ) {
#detect a header row by keyword.
#can probably do this after 'open' but this way
#means we can use <> and an arbitrary file list.
if ( m/^Accesion/ ) {
@header = split;
shift @header; #drop "accession" off the list so it's just S01,02,03 etc.
$seen_cols{$_}++ for @header; #keep track of uniques.
}
else {
#not a header row - split the row on whitespace.
#can do /\t/ if that's not good enough, but it looks like it should be.
my ( $ID, @fields ) = split;
#use has slice to populate row.
my %this_row;
@this_row{@header} = @fields;
#debugging
print Dumper \%this_row;
#push each field onto the all rows hash.
foreach my $column ( @header ) {
#append current to field, in case there's duplicates (no overwriting)
$all_rows{$ID}{$column} .= $this_row{$column};
}
}
}
#print for debugging
print Dumper \%all_rows;
print Dumper \%seen_cols;
#grab list of column headings we've seen, and order them.
my @cols_to_print = sort keys %seen_cols;
#print header row.
print join "\t", "Accesion", @cols_to_print,"\n";
#iteate keys, and splice.
foreach my $key ( sort keys %all_rows ) {
#print one row at a time.
#map iterates all the columns, and gives the value or an empty string
#if it's undefined. (prevents errors)
print join "\t", $key, (map { $all_rows{$key}{$_} // '' } @cols_to_print),"\n"
}
Accesion S01 S02 S03
AJ863320 1 2 5
AM930424 1 1
AY664038 2
EU236327 1 2
EU434346 2 2