Bash 重新排列文件上的数据(不是直接转置)

Bash 重新排列文件上的数据(不是直接转置),bash,shell,awk,sed,gawk,Bash,Shell,Awk,Sed,Gawk,我有这样一个文件(超过2.5k行): NAME YEAR A B C JOHN Y1 10,00 19,00 65,00 JOHN Y2 11,00 23,00 64,00 JOHN Y3 12,00 33,00 34,00 JOHN Y4 13,00 34,00 32,00 PAUL Y1 14,00 43,00 23,00 PAUL Y2 15,00 90,00 34,00 PAUL Y3 16,00 32,00 56,00 PAUL Y4 20,00 45,00 65,00 RINGO Y

我有这样一个文件(超过2.5k行):

NAME YEAR A B C
JOHN Y1 10,00 19,00 65,00
JOHN Y2 11,00 23,00 64,00
JOHN Y3 12,00 33,00 34,00
JOHN Y4 13,00 34,00 32,00
PAUL Y1 14,00 43,00 23,00
PAUL Y2 15,00 90,00 34,00
PAUL Y3 16,00 32,00 56,00
PAUL Y4 20,00 45,00 65,00
RINGO Y1 25,00 60,00 87,00
RINGO Y2 24,00 30,00 23,00
RINGO Y3 31,00 20,00 54,00
RINGO Y4 75,00 10,00 12,00
NAME A/Y1 A/Y2 A/Y3 A/Y4 B/Y1 B/Y2 B/Y3 B/Y4 C/Y1 C/Y2 C/Y3 C/Y4
JOHN 10,00 11,00 12,00 13,00 19,00 23,00 33,00 34,00 65,00 64,00 34,00 32,00
PAUL 14,00 15,00 16,00 20,00 43,00 90,00 32,00 45,00 23,00 34,00 56,00 65,00
RINGO 25,00 24,00 31,00 75,00 60,00 30,00 20,00 10,00 87,00 23,00 54,00 12,00
如您所见,每个名称重复4次(4行)以“存储”4年值,每年有3个值(A、B和C)

我需要重新排列数据,以便每个名称只显示在一行中。因此,最初在行中显示的4年必须在新列中显示,如下所示:

NAME YEAR A B C
JOHN Y1 10,00 19,00 65,00
JOHN Y2 11,00 23,00 64,00
JOHN Y3 12,00 33,00 34,00
JOHN Y4 13,00 34,00 32,00
PAUL Y1 14,00 43,00 23,00
PAUL Y2 15,00 90,00 34,00
PAUL Y3 16,00 32,00 56,00
PAUL Y4 20,00 45,00 65,00
RINGO Y1 25,00 60,00 87,00
RINGO Y2 24,00 30,00 23,00
RINGO Y3 31,00 20,00 54,00
RINGO Y4 75,00 10,00 12,00
NAME A/Y1 A/Y2 A/Y3 A/Y4 B/Y1 B/Y2 B/Y3 B/Y4 C/Y1 C/Y2 C/Y3 C/Y4
JOHN 10,00 11,00 12,00 13,00 19,00 23,00 33,00 34,00 65,00 64,00 34,00 32,00
PAUL 14,00 15,00 16,00 20,00 43,00 90,00 32,00 45,00 23,00 34,00 56,00 65,00
RINGO 25,00 24,00 31,00 75,00 60,00 30,00 20,00 10,00 87,00 23,00 54,00 12,00
此外,可接受的输出格式可以是:

NAME Y1/A Y1/B Y1/C Y2/A Y2/B Y2/C Y3/A Y3/B Y3/C Y4/A Y4/B Y4/C
我不是舒尔,哪种输出格式更容易实现,但两种输出格式都可以

据我所知,这不是一个“直接转置”,我也没有发现任何类似的问题,这就是为什么我再次提出了这个问题,并提供了更多细节。

我用python提出的建议(sed显然不能胜任这项任务,可能是awk,但这是一个挑战)。我已经硬编码了“矩阵”的4*3方面。也许可以做些更优雅的事情:

import collections

nb_year = 4
d = collections.defaultdict(lambda: [None]*nb_year*3)    


with open("input_file") as infile:

    next(infile)  # skip title

    for l in infile:  # read line by line
        fields = l.strip().split()  # extract blank-separated fields
        if len(fields)<3: continue  # protection against "accidental" blank lines
        target = d[fields[0]]       # name
        offset = int(fields[1][1])-1    # extract year index 1 to 4
        for i,f in enumerate(fields[2:]):  # interleaved matrix fill
            target[offset+i*nb_year] = f      # fill "matrix"

    print("NAME A/Y1 A/Y2 A/Y3 A/Y4 B/Y1 B/Y2 B/Y3 B/Y4 C/Y1 C/Y2 C/Y3 C/Y4")
    for k,v in sorted(d.items()):
        print("{} {}".format(k," ".join(v)))

将GNU awk用于真正的多维阵列:

$ cat tst.awk
NR==1 { split($0,hdr); next }
{
    idx = (NR-2)%4+1
    val[idx][0]
    split($0,val[idx])
}
NR==5 {
    printf "%s", hdr[1]
    for (j=3; j in hdr; j++) {
        for (i=1; i<=idx; i++) {
            printf "%s%s", OFS, hdr[j]"/"val[i][2]
        }
    }
    print ""
}
idx==4 {
    printf "%s", $1
    for (j=3; j<=NF; j++) {
        for (i=1; i<=idx; i++) {
            printf "%s%s", OFS, val[i][j]
        }
    }
    print ""
}

$ awk -f tst.awk file
NAME A/Y1 A/Y2 A/Y3 A/Y4 B/Y1 B/Y2 B/Y3 B/Y4 C/Y1 C/Y2 C/Y3 C/Y4
JOHN 10,00 11,00 12,00 13,00 19,00 23,00 33,00 34,00 65,00 64,00 34,00 32,00
PAUL 14,00 15,00 16,00 20,00 43,00 90,00 32,00 45,00 23,00 34,00 56,00 65,00
RINGO 25,00 24,00 31,00 75,00 60,00 30,00 20,00 10,00 87,00 23,00 54,00 12,00
$cat tst.awk
NR==1{split($0,hdr);next}
{
idx=(NR-2)%4+1
val[idx][0]
拆分($0,val[idx])
}
NR==5{
打印文件“%s”,hdr[1]
对于(j=3;hdr中的j;j++){

对于(i=1;i
awk
解决方案:

$ cat script.awk
#!/bin/awk

{
    if( length($1) > 0 )
    {
        if( prev != $1 )
        {
            str = ""
            n = 0
        }

        str = str FS $0

        n = n + 1

        if( n == 4 )
        {
            split( str, a, FS )
            print a[1],a[3],a[8],a[13],a[18],a[4],a[9],a[14],a[19],a[5],a[10],a[15],a[20]
        }

        prev = $1
    }

}

# eof #
测试:

$ awk -f script.awk -- input.txt 
JOHN 10,00 11,00 12,00 13,00 19,00 23,00 33,00 34,00 65,00 64,00 34,00 32,00
PAUL 14,00 15,00 16,00 20,00 43,00 90,00 32,00 45,00 23,00 34,00 56,00 65,00
RINGO 25,00 24,00 31,00 75,00 60,00 30,00 20,00 10,00 87,00 23,00 54,00 12,00
希望有帮助

$ cat foo.awk
NR==1{next}                                              # skip the header
{
    printf "%s", (b!=$1?(b==""?"":ORS) $1:"") OFS; b=$1  # print name or OFS
} 
{
    printf "%s", $3 OFS $4 OFS $5                        # print fields
} 
END {print ""}                                           # finish with ORS
旋转它:

$ awk -f foo.awk foo.txt
JOHN 10,00 19,00 65,00 11,00 23,00 64,00 12,00 33,00 34,00 13,00 34,00 32,00
PAUL 14,00 43,00 23,00 15,00 90,00 34,00 16,00 32,00 56,00 20,00 45,00 65,00
RINGO 25,00 60,00 87,00 24,00 30,00 23,00 31,00 20,00 54,00 75,00 10,00 12,00
使用的
collapse
功能,几乎完成了解决方案(标题行是手动完成的):

echo \
  "NAME    A/Y1  A/Y2  A/Y3  A/Y4  B/Y1  B/Y2  B/Y3  B/Y4  C/Y1  C/Y2  C/Y3  C/Y4"
tr ',' '.' < input.txt | \
datamash --header-in -W -g1 collapse A collapse B collapse C | \
tr '[.,]' '[, ]'

注意:
tr
这个东西是因为
collapse
使用逗号作为输出分隔符,所以为了避免太多的逗号,这些逗号会在前后移动一点,然后向后移动

如果需要,可以使用代码从input.txt生成标题(但它比简单的硬编码
echo
)更长更难看):


对于
perl
,泛型在某种意义上可以有不同的年数和不同的列数

$ cat ip.txt 
NAME YEAR A B C
JOHN Y1 10,00 19,00 65,00
JOHN Y2 11,00 23,00 64,00
JOHN Y3 12,00 33,00 34,00
JOHN Y4 13,00 34,00 32,00
PAUL Y1 14,00 43,00 23,00
PAUL Y2 15,00 90,00 34,00
PAUL Y3 16,00 32,00 56,00
PAUL Y4 20,00 45,00 65,00
RINGO Y1 25,00 60,00 87,00
RINGO Y2 24,00 30,00 23,00
RINGO Y3 31,00 20,00 54,00
RINGO Y4 75,00 10,00 12,00
假设在打印输出时对名称进行排序已经足够了

$ perl -ae '
@h = @F[0,2..$#F] if $. == 1;
if($. > 1)
{
    $d{$F[0]} .= " ".join(" ",@F[2..$#F]);
    $hh[$i++] = $F[1] if !$seen{$F[1]}++;
}
END
{
     print "$h[0] ";
     foreach (@hh){ for($j=1; $j <= $#h; $j++) {print "$_/$h[$j] "} }
     print "\n";
     print "$_$d{$_}\n" foreach (sort keys %d);
}
' ip.txt
NAME Y1/A Y1/B Y1/C Y2/A Y2/B Y2/C Y3/A Y3/B Y3/C Y4/A Y4/B Y4/C 
JOHN 10,00 19,00 65,00 11,00 23,00 64,00 12,00 33,00 34,00 13,00 34,00 32,00
PAUL 14,00 43,00 23,00 15,00 90,00 34,00 16,00 32,00 56,00 20,00 45,00 65,00
RINGO 25,00 60,00 87,00 24,00 30,00 23,00 31,00 20,00 54,00 75,00 10,00 12,00

不建议要求会员给你答案。本网站旨在帮助人们完善/纠正你的尝试。每个输入/输出之间没有空行,对不起,粘贴数据时是我的错。数据编辑,以排除不存在的空行。无法想象为什么你会认为这对awk是一个挑战,但无论如何-更重要nt而不是硬编码矩阵,值得一提的是,您正在打印硬编码的标题行,而不是从输入文件中读取它!这对我来说是一个挑战+awk:)你是对的,这可能更一般。你也错过了一个细节。所有的值都是从a开始的,然后是B和C。不能简单地连接值…@Sundeep同样,一个可接受的输出格式可以是:
NAME Y1/a Y1/B Y1/C Y2/a…。
哦!这确实更容易解决问题!我似乎错误地删除了另一个答案! :-/ :-/
$ perl -ae '
@h = @F[0,2..$#F] if $. == 1;
if($. > 1)
{
    $d{$F[0]} .= " ".join(" ",@F[2..$#F]);
    $hh[$i++] = $F[1] if !$seen{$F[1]}++;
}
END
{
     print "$h[0] ";
     foreach (@hh){ for($j=1; $j <= $#h; $j++) {print "$_/$h[$j] "} }
     print "\n";
     print "$_$d{$_}\n" foreach (sort keys %d);
}
' ip.txt
NAME Y1/A Y1/B Y1/C Y2/A Y2/B Y2/C Y3/A Y3/B Y3/C Y4/A Y4/B Y4/C 
JOHN 10,00 19,00 65,00 11,00 23,00 64,00 12,00 33,00 34,00 13,00 34,00 32,00
PAUL 14,00 43,00 23,00 15,00 90,00 34,00 16,00 32,00 56,00 20,00 45,00 65,00
RINGO 25,00 60,00 87,00 24,00 30,00 23,00 31,00 20,00 54,00 75,00 10,00 12,00
$ cat ip1.txt 
NAME YEAR A B
JOHN Y1 10,00 19,00
JOHN Y2 11,00 23,00
PAUL Y1 14,00 43,00
PAUL Y2 15,00 90,00

$ perl -ae '
@h = @F[0,2..$#F] if $. == 1;
if($. > 1)
{
    $d{$F[0]} .= " ".join(" ",@F[2..$#F]);
    $hh[$i++] = $F[1] if !$seen{$F[1]}++;
}
END
{
     print "$h[0] ";
     foreach (@hh){ for($j=1; $j <= $#h; $j++) {print "$_/$h[$j] "} }
     print "\n";
     print "$_$d{$_}\n" foreach (sort keys %d);
}
' ip1.txt
NAME Y1/A Y1/B Y2/A Y2/B 
JOHN 10,00 19,00 11,00 23,00
PAUL 14,00 43,00 15,00 90,00