Csv Awk 4.1.4处理大文件时出错
我正在Centos 7.6(x86_64)上使用Awk 4.1.4和250 GB RAM,根据最后一列(示例_键)将行范围的csv文件转换为列范围的csv文件。下面是一个小行宽csv示例Csv Awk 4.1.4处理大文件时出错,csv,awk,gnu,Csv,Awk,Gnu,我正在Centos 7.6(x86_64)上使用Awk 4.1.4和250 GB RAM,根据最后一列(示例_键)将行范围的csv文件转换为列范围的csv文件。下面是一个小行宽csv示例 Probe_Key,Ind_Beta,Sample_Key 1,0.6277,7417 2,0.9431,7417 3,0.9633,7417 4,0.8827,7417 5,0.9761,7417 6,0.1799,7417 7,0.9191,7417 8,0.8257,7417 9,0.9111,7417
Probe_Key,Ind_Beta,Sample_Key
1,0.6277,7417
2,0.9431,7417
3,0.9633,7417
4,0.8827,7417
5,0.9761,7417
6,0.1799,7417
7,0.9191,7417
8,0.8257,7417
9,0.9111,7417
1,0.6253,7387
2,0.9495,7387
3,0.5551,7387
4,0.8913,7387
5,0.6197,7387
6,0.7188,7387
7,0.8282,7387
8,0.9157,7387
9,0.9336,7387
这就是上面的小csv示例的正确输出
7387,0.6253,0.9495,0.5551,0.8913,0.6197,0.7188,0.8282,0.9157,0.9336
7417,0.6277,0.9431,0.9633,0.8827,0.9761,0.1799,0.9191,0.8257,0.9111
下面是实现行到列范围转换的awk代码(基于)
BEGIN{
printf "Probe_Key,ind_beta,Sample_Key\n";
}
NR > 1{
ks[$3 $1] = $2; # save the second column using the first and third as index
k1[$1]++; # save the first column
k2[$3]++; # save the third column
}
END {
# After processing input
for (i in k2) # loop over third column
{
printf "%s,", i ; # print it as first value in the row
for (j in k1) # loop over the first column (index)
{
if ( j < length(k1) )
{
printf "%s,",ks[i j]; #and print values ks[third_col first_col]
}
else
printf "%s",ks[i j]; #print last value
}
print ""; # newline
}
}
当我使用最大的行宽csv文件(126 GB大小)时,我得到以下错误
ERROR (EXIT CODE 255) Unknow error code
当代码适用于较小的输入大小时,我如何调试这两种情况?如果您的数据已经分组在字段3中并排序为1,您只需执行以下操作即可
$ awk -F, 'NR==1 {next}
{if(p!=$3)
{if(p) print v; v=$3 FS $2; p=$3}
else v=v FS $2}
END{print v}' file
7417,0.6277,0.9431,0.9633,0.8827,0.9761,0.1799,0.9191,0.8257,0.9111
7387,0.6253,0.9495,0.5551,0.8913,0.6197,0.7188,0.8282,0.9157,0.9336
如果没有,则最好先进行预排序,而不是将所有数据缓存在内存中,这样会导致大型输入文件的崩溃。如果您的数据已经分组在字段3中,并已排序为1,则只需执行以下操作即可
$ awk -F, 'NR==1 {next}
{if(p!=$3)
{if(p) print v; v=$3 FS $2; p=$3}
else v=v FS $2}
END{print v}' file
7417,0.6277,0.9431,0.9633,0.8827,0.9761,0.1799,0.9191,0.8257,0.9111
7387,0.6253,0.9495,0.5551,0.8913,0.6197,0.7188,0.8282,0.9157,0.9336
如果没有,则最好先进行预排序,而不是将所有数据缓存在内存中,这样会导致大型输入文件的崩溃。而不是试图将所有5GB(或126GB)的数据一次保存在内存中,并在最后一起打印所有数据,下面是一种使用
排序
的方法,可以在输入时将每组值分组:
$ datamash --header-in -t, -g3 collapse 2 < input.csv | sort -t, -k1,1n
7387,0.6253,0.9495,0.5551,0.8913,0.6197,0.7188,0.8282,0.9157,0.9336
7417,0.6277,0.9431,0.9633,0.8827,0.9761,0.1799,0.9191,0.8257,0.9111
如果您可以去掉该标题行,这样就可以直接将文件传递给
排序
,而不是通过管道,那么它就可以选择一种更有效的排序方法,事先知道完整的大小。而不是试图一次将所有5GB(或126GB)的数据保存在内存中,并在最后将所有数据一起打印出来,下面是一种使用排序
的方法,可以在输入时将每组值分组:
$ datamash --header-in -t, -g3 collapse 2 < input.csv | sort -t, -k1,1n
7387,0.6253,0.9495,0.5551,0.8913,0.6197,0.7188,0.8282,0.9157,0.9336
7417,0.6277,0.9431,0.9633,0.8827,0.9761,0.1799,0.9191,0.8257,0.9111
如果您可以去掉该标题行,这样就可以直接将文件传递给
sort
,而不是通过管道,那么它可能能够选择一种更有效的排序方法,事先知道完整的大小。请您对代码进行注释。非常感谢您的帮助。请您评论一下代码。真的很感谢你的帮助你能不能在这一点上说一句话你能不能在这一点上说一句话