如何使用Perl Text::CSV基于重复字段组合CSV行?

如何使用Perl Text::CSV基于重复字段组合CSV行?,perl,csv,Perl,Csv,我想编写一个Perl脚本,它将: 定期监视输入CSV文件的文件目录 检测到文件后,打开、读取并合并第二个字段/列具有相同值的多行 将更新的CSV文件写入新目录,最后, 删除输入文件。 例如,我有一个CSV文件,其中包含以下信息: "101","5555555555","DOE, JOHN "," DOE, JOHN, your trip tomorrow from, 123 Anywhere St Apt #A, to, 100 ELSEWHERE RD APT E, is scheduled

我想编写一个Perl脚本,它将:

定期监视输入CSV文件的文件目录 检测到文件后,打开、读取并合并第二个字段/列具有相同值的多行 将更新的CSV文件写入新目录,最后, 删除输入文件。 例如,我有一个CSV文件,其中包含以下信息:

"101","5555555555","DOE, JOHN "," DOE, JOHN, your trip
tomorrow from, 123 Anywhere St Apt #A, to, 100 ELSEWHERE RD APT E, is
scheduled for pickup between, 1:00 PM, and 1:30 PM"

"102","5555555555","DOE, JOHN "," DOE, JOHN, your trip
tomorrow from, 100 ELSEWHERE RD APT E, to, 123 Anywhere St Apt #A, is
scheduled for pickup between, 9:00 PM, and 9:30 PM"
我希望脚本读取、解析并检测第二个字段5555的重复值,然后创建一个新的CSV文件,将上述记录合并为一个记录,如下所示:

"101","5555555555","DOE, JOHN "," DOE, JOHN, your trip
tomorrow from, 123 Anywhere St Apt #A, to, 100 ELSEWHERE RD APT E, is
scheduled for pickup between, 1:00 PM, and 1:30 PM AND your trip
tomorrow from, 100 ELSEWHERE RD APT E, to, 123 Anywhere St Apt #A, is
scheduled for pickup between, 9:00 PM, and 9:30 PM"
我当前的Perl代码成功地检测、读取和解析了文件,但是,我不知道如何检测重复项并合并行

#!
use strict;
use warnings;
use File::Find;
use Text::CSV;

$| = 1;

use constant {
    #Check for CSV files only
    SUFFIX_LIST => qr/\.(csv)$/,
    DIR_TO_CHECK => "/Users/Me/Desktop/INBOUND/",
};

my @file_list;

while (1) {

    #Recursively search the input directory for CSV files
    find ( sub {
            return unless -f;
            return unless $_ =~ SUFFIX_LIST;

                #Make sure all of the files in the file list array are unique
                if(!(grep(/^$_$/, @file_list))) {
                    push @file_list, $File::Find::name;
                }
           }, DIR_TO_CHECK 
    );

#If .csv files are found...
if (scalar(@file_list) > 0) {
    print "\nNew Item in Directory\n";

    parseFile($file_list[0]);

    #Delete input file
    unlink $file_list[0];

    print "Deleted File\n";

    #Remove the file from the file list
    shift @file_list;
} else {

    print "No New Item\n";

}

sleep 5;
}

#Subroutine to parse and compare the csv file
sub parseFile() {

my $csv = Text::CSV->new({ sep_char     => ',',
                       always_quote => 1,
                       quote_char   => '"',
                       escape_char  => '"',
                       binary       => 1,
                       auto_diag    => 1});

#Get the file that was passed to the function
my $file = $_[0] or die "CSV file not passed in subroutine\n";

#Open file for reading
open(my $data, '<', $file) or die "Could not open '$file' $!\n";

while (my $line = <$data>) {

    print $line;

    if ($csv->parse($line)) {

        my @fields = $csv->fields();

    } else {

        #warn "Line could not be parsed: $line\n";
        Text::CSV->error_input();
    }
}

close $data;
}

我认为我所拥有的功能是错误的,因为我怀疑我需要将文件作为一个整体读取到内存中,而不是逐行读取。请帮忙,谢谢。

我这几天不喜欢perl,但这里是我的答案。创建一个以第二个字段为键的哈希表。像这样

%hashtbl{555555} = {
                    id => 102,                         # first field 
                    names => "doe, john",              # third field
                    msg => "DOE, JOHN, your trip..."   # last field 
                    };
如果该键已经存在于哈希表中,则追加其msg


读取整个文件后,使用此哈希表创建一个新的csv文件。

类似的方法应该可以工作

它不是完美的,但它应该会给你一个巨大的推动。例如,您需要添加一些垃圾来删除展开描述列中的额外名称

my $data = parseFile($path);
flatten_record($_) for @$data;
writeFile($newpath, $data);


sub csv_cols { qw/ id phone name desc / ) }

sub get_csv {
    my $csv = Text::CSV->new({
        sep_char     => ',',
        always_quote => 1,
        quote_char   => '"',
        escape_char  => '"',
        binary       => 1,
        auto_diag    => 1
    });
}


#Subroutine to parse csv file
sub parseFile() {
    my ($file) = @_;    
    die "CSV file not passed in subroutine\n"
         unless $file;

    my $csv = get_csv();

    #Open file for reading
    open(my $fh, '<', $file)
         or die "Could not open '$file' $!\n";

    $csv->column_names( csv_cols() );

    # make hash of arrays containing 
    my %by_phone;
    for my $row ( @{$csv->getline_hr_all($fh)} ) {
        my $phone = $row->{phone}
        $by_phone{$phone} = [] unless $by_phone{$phone};
        push @{$by_phone{$phone}}, $row;
    }

    return [ values %by_phone ];
}


sub flatten_record {
    my ($record) = @_;

    die "Empty record." if @$record == 0;

    if ( @$record == 1 ) {
         $record = $record->[0];
    } else {
         $record = {
             id    => $record->[0]{id},
             phone => $record->[0]{phone},
             name  => $record->[0]{name},
             desc  => "$record->[0]{desc} AND $record->[1]{desc}",
         };
    }

    return $record;
}

sub writeFile {
    my ( $path, $data ) = @_;

    open my $fh, ">", $path
        or die "Error opening '$path' for writing- $!\n";

    my $csv = get_csv();

    for my $record ( $data ) {
        my @row = @{$record}{ csv_cols() };
        $csv->print( $fh, \@row );
    }
}

看起来第一列没有用于重复检测,但是第三列呢?此外,行是否需要按特定顺序合并?@ThisSuitesBlack第三列也不用于重复检测。理想情况下,行将按照第一列指定的顺序合并。谢谢如果,出于某种原因,你有一行1,42,jack,foo,然后是2,42,jill,bar,合并后的结果会让jack或jill出现在第三列吗?@ThisSuiteisBlack不是个好问题。。。至少现在,我会同意杰克。因此,更新后的行是1,42、jack、foo和bar
my $data = parseFile($path);
flatten_record($_) for @$data;
writeFile($newpath, $data);


sub csv_cols { qw/ id phone name desc / ) }

sub get_csv {
    my $csv = Text::CSV->new({
        sep_char     => ',',
        always_quote => 1,
        quote_char   => '"',
        escape_char  => '"',
        binary       => 1,
        auto_diag    => 1
    });
}


#Subroutine to parse csv file
sub parseFile() {
    my ($file) = @_;    
    die "CSV file not passed in subroutine\n"
         unless $file;

    my $csv = get_csv();

    #Open file for reading
    open(my $fh, '<', $file)
         or die "Could not open '$file' $!\n";

    $csv->column_names( csv_cols() );

    # make hash of arrays containing 
    my %by_phone;
    for my $row ( @{$csv->getline_hr_all($fh)} ) {
        my $phone = $row->{phone}
        $by_phone{$phone} = [] unless $by_phone{$phone};
        push @{$by_phone{$phone}}, $row;
    }

    return [ values %by_phone ];
}


sub flatten_record {
    my ($record) = @_;

    die "Empty record." if @$record == 0;

    if ( @$record == 1 ) {
         $record = $record->[0];
    } else {
         $record = {
             id    => $record->[0]{id},
             phone => $record->[0]{phone},
             name  => $record->[0]{name},
             desc  => "$record->[0]{desc} AND $record->[1]{desc}",
         };
    }

    return $record;
}

sub writeFile {
    my ( $path, $data ) = @_;

    open my $fh, ">", $path
        or die "Error opening '$path' for writing- $!\n";

    my $csv = get_csv();

    for my $record ( $data ) {
        my @row = @{$record}{ csv_cols() };
        $csv->print( $fh, \@row );
    }
}