Bash 分割巨大的CSV文件_Bash_Csv_Unix_Cut

Bash 分割巨大的CSV文件

bash csv unix

Bash 分割巨大的CSV文件,bash,csv,unix,cut,Bash,Csv,Unix,Cut,我有一个巨大的csv文件，大约20GB。它有5000列和2500000行。我想把它的每一列都写进一个文件。我已经试过了，但是速度很慢。我的代码如下： Columns=$(head -n 1 train.csv | sed "s/,/\n/g" | wc -l) mkdir cols for i in `seq 1 $Columns`; do echo $i tail -n +2 train.csv | cut -d',' -f$i > cols/col_$i.txt don

我有一个巨大的csv文件，大约20GB。它有5000列和2500000行。我想把它的每一列都写进一个文件。我已经试过了，但是速度很慢。我的代码如下：

Columns=$(head -n 1 train.csv | sed "s/,/\n/g" | wc -l)
mkdir cols
for i in `seq 1 $Columns`;
do
    echo $i
    tail -n +2 train.csv | cut -d',' -f$i > cols/col_$i.txt
done

我会采纳任何建议来加速这一过程。

这里有一个bash脚本，它可以在一次过程中完成这一任务：

Columns=$(head -n 1 train.csv | sed "s/,/\n/g" | wc -l)
mkdir cols
tail -n +2 train.csv | \
    while IFS=, read -ra row; do
        for i in `seq 1 $Columns`; do
            echo "${row[$(($i-1))]}" >> cols/col_$i.txt
        done 
    done

此脚本的缺点是它将打开和关闭列文件数百万次。以下perl脚本通过保持所有文件处于打开状态来避免该问题：

#!/usr/bin/perl
use strict;
use warnings;

my @handles;
open my $fh,'<','train.csv' or die;
<$fh>; #skip the header
while (<$fh>) {
    chomp;
    my @values=split /,/;
    for (my $i=0; $i<@values; $i++) {
        if (!defined $handles[$i]) {
            open $handles[$i],'>','cols/col_'.($i+1).'.txt' or die;
        }
        print {$handles[$i]} "$values[$i]\n";
    }
}
close $fh;
close $_ for @handles;

#/usr/bin/perl
严格使用；
使用警告；
我的@手柄；
打开我的$fh、、'cols/col_u2;'（$i+1）。'.txt'或死亡；
}
打印{$handles[$i]}“$values[$i]\n”；
}
}
收盘价$fh；
关闭@句柄的$uu；

由于您有5000列，并且此脚本保持5001个文件处于打开状态，因此需要增加系统允许的打开文件描述符的数量。

Perl解决方案。它一次打开1000个文件，因此它会将您的输入传递5次。以输入文件名作为参数运行

#!/usr/bin/perl
use warnings;
use strict;

my $inputfile = shift;
open my $input, '<', $inputfile or die $!;

mkdir 'cols';

my @headers = split /,/, <$input>;
chomp $headers[-1];
my $pos = tell $input;  # Remember where the first data line starts.

my $step = 1000;
for (my $from = 0; $from <= $#headers; $from += $step) {
    my $to = $from + $step - 1;
    $to = $#headers if $#headers < $to;
    warn "$from .. $to";

    # Open the files and print the headers in range.    
    my @fhs;
    for ($from .. $to) {
        open $fhs[ $_ - $from ], '>', "cols/col-$_" or die $!;
        print { $fhs[ $_ - $from ] } $headers[$_], "\n";
    }

    # Print the columns in range.
    while (<$input>) {
        chomp;
        my $i = 0;
        print { $fhs[$i++] } $_, "\n" for (split /,/)[ $from .. $to ];
    }
    close for @fhs;
    seek $input, $pos, 0;  # Go back to the first data line.
}

#/usr/bin/perl
使用警告；
严格使用；
my$inputfile=shift；
打开我的$input、、“cols/col-$\”或die$！；
打印{$fhs[$\-$from]}$headers[$\]，“\n”；
}
#打印范围内的列。
而（）{
咀嚼；
我的$i=0；
打印{$fhs[$i++]}$\，“\n”for（split/，/）[$from..$to]；
}
关闭@fhs；
查找$input，$pos，0；#返回到第一个数据行。
}
在awk中：
$ awk '{for(i=1;i<=NF;i++) print $i > i}' train.csv

有多少列？~5000列和2500000行。不能将其作为工作表加载到数据库中并从中进行处理吗，嗯，5000匹，可能不会。我认为操作系统会根据一次允许打开的文件数量限制运行速度。当您编写5000列中的每一列时，意味着5000个文件，因此如果任何操作系统（没有自定义构建）都支持这一点，我会感到惊讶。查看ulimit-a | grep文件的输出。对于nofile
，您看到了什么。这是1个进程打开的文件数。所以你可能需要解决这个问题。并在打开文件[n+1]
时关闭文件[1]。祝你好运。你想要一个内嵌的逗号做什么，比如'1，“b，b”，“ccc”`？@JamesBrown:很好，我需要更新我的环境！我清楚地记得在Sun OS上有20个打开的文件要处理（回到过去；-）。祝大家好运！你不能同时打开5000个文件。哇，我没有注意到这个要求。你为什么不能？我使用的是香草Debian，如果我cat/proc//limits | grep“open files”
我会得到65536。@JamesBrown:好的，在我的解决方案中增加$step：-）
$ cat > foo
1
2
3
$ awk 'BEGIN {for(i=1;i<=5000;i++) a=a i (i<5000? OFS:"")} {$0=a; for(i=1;i<=NF; i++) print $i > i}' foo
$ ls -l | wc -l
5002 # = 1-5000 + foo and "total 20004"
$ cat 5000
5000
5000
5000

real    1m4.691s
user    1m4.456s
sys     0m0.180s