Perl 跨多个csv文件匹配行并合并特定字段
我有大约20个CSV,都是这样的:Perl 跨多个csv文件匹配行并合并特定字段,perl,bash,scripting,csv,Perl,Bash,Scripting,Csv,我有大约20个CSV,都是这样的: "[email]","[fname]","[lname]","[prefix]","[suffix]","[fax]","[phone]","[business]","[address1]","[address2]","[city]","[state]","[zip]","[setdate]","[email_type]","[start_code]" sort *.csv | ./script.pl 我被告知我需要生成的是完全相同的东西,但是现在每个文件
"[email]","[fname]","[lname]","[prefix]","[suffix]","[fax]","[phone]","[business]","[address1]","[address2]","[city]","[state]","[zip]","[setdate]","[email_type]","[start_code]"
sort *.csv | ./script.pl
我被告知我需要生成的是完全相同的东西,但是现在每个文件都包含来自电子邮件匹配的每个其他文件的开始代码
如果任何其他字段不匹配,这并不重要,只是电子邮件字段很重要,对每个文件的唯一更改是从电子邮件匹配的其他文件中添加任何其他start_代码值
例如,如果相同的电子邮件出现在wicq.csv、oota.csv和itos.csv文件中,则每个文件中都会出现以下内容:
"anon@yahoo.com","anon",,,,,,,,,,,,01/16/08 08:05 PM,,"WIQC PDX"
"anon@yahoo.com","anon",,,,,,,,,,,,01/16/08 08:05 PM,,"OOTA"
"anon@yahoo.com","anon",,,,,,,,,,,,01/16/08 08:05 PM,,"ITOS"
到
对于所有三个文件(wicq.csv、oota.csv和itos.csv)
我可以使用的工具包括OS X命令行(awk、sed等)和perl,尽管我对它们都不太熟悉,但可能有更好的方法来实现这一点。我将通过以下方法来实现这一点:
cut -d ',' -f1,16 *.csv |
sort |
awk -F, '{d=""; if (array[$1]) d=","; array[$1] = array[$1] d $2} END { for (i in array) print i "," array[i]}' |
while IFS="," read -r email start; do sed -i "/^$email,/ s/,[^,]*\$/,$start/" *.csv; done
这将创建所有电子邮件的列表(cut
/sort
)并开始编码和合并(awk
)它们。然后,它将替换(sed
)每个文件中每个匹配电子邮件的起始代码(while
)
但我觉得必须有一种更有效的方法。严格使用;
use strict;
use warnings;
use Text::CSV_XS;
# Supply csv files as command line arguments.
my @csv_files = @ARGV;
my $parser = Text::CSV_XS->new;
# In my test data, the email is the first field. The field
# to be merged is the second. Adjust accordingly.
my $EMAIL_i = 0;
my $MERGE_i = 1;
# Process all files, creating a set of key-value pairs:
# $sc{EMAIL} = [ LIST OF VALUES OBSERVED IN THE MERGE FIELD ]
my %sc;
for my $cf (@csv_files){
open(my $fh_in, '<', $cf) or die $!;
while (my $line = <$fh_in>){
die "Failed parse : $cf : $.\n" unless $parser->parse($line);
my @fields = $parser->fields;
push @{ $sc{$fields[$EMAIL_i]} }, $fields[$MERGE_i];
}
}
# Process the files again, writing new output.
for my $cf (@csv_files){
open(my $fh_in, '<', $cf) or die $!;
open(my $fh_out, '>', "${cf}_new.csv") or die $!;
while (my $line = <$fh_in>){
die "Failed parse : $cf : $.\n" unless $parser->parse($line);
my @fields = $parser->fields;
$fields[$MERGE_i] = join ', ', @{ $sc{$fields[$EMAIL_i]} };
$parser->print($fh_out, \@fields);
print $fh_out "\n";
}
}
使用警告;
使用Text::csvxs;
#提供csv文件作为命令行参数。
我的@csv_文件=@ARGV;
my$parser=Text::CSV_XS->new;
#在我的测试数据中,电子邮件是第一个字段。田野
#第二种是合并。相应地调整。
我的$EMAIL_i=0;
我的$MERGE_i=1;
#处理所有文件,创建一组键值对:
#$sc{EMAIL}=[在合并字段中观察到的值列表]
我的%sc;
对于我的$cf(@csv_文件){
open(我的$fh_in,“这里有一个简单的Perl程序,它实现了您所需要的功能。它通过依赖于预先排序的事实对您的输入进行单次传递
只要电子邮件没有更改,它就会读取行并附加代码。当电子邮件更改时,它会打印记录(并在代码字段中修复额外的双引号)
因此,这些修改WIQC、PDX、OOTA、ITOS
会挤进这三个csv文件中吗?@Anders,是的。(虽然WICQ-PDX是一个修改,而不是你评论中提到的两个)。我重命名了所有文件,以小写字符开头,因为任何带有大写字符的文件都会出现以下错误:“sed:1:”R2R.csv“:无效命令代码R”我现在收到此错误:“sed:1:“bwtl.csv”:未定义的标签“wtl.csv”“我认为这是由于相同的初始问题造成的,即sed将文件名作为命令。@alex:请仔细检查以确保您没有丢失星号前的空格或有任何错误的引号。您是基于GNU的(例如Linux)用户吗?”系统?您的文件在数据中是否有斜杠?您可以尝试将sed
命令中的分隔符更改为管道('s | old | new |'
)或数据中不包含的其他字符。这非常有效!我必须输入“binmode$fh|in,:utf8”;“并手动清除每个文件中的一些空行(:g/^$/d)但这起作用了,谢谢。
#!/usr/bin/perl -l
use strict;
use warnings;
my $last_email = undef;
my @current_record = ();
my @fields = ();
sub print_record {
# Remove repeated double quotes introduced when we appended the code
$current_record[15] =~ s/""/, /g;
print join ",", @current_record;
@current_record = ();
}
while (my $input_line = <>) {
chomp $input_line;
@fields = split ",", $input_line;
# Print a record when the email we read changes. Avoid printing on the first
# loop by checking we have read at least one email ($last_email is defined).
defined $last_email && ($fields[0] ne $last_email) && print_record;
if (!@current_record) {
# We are starting to process a new email. Grab all fields.
@current_record = @fields;
}
else {
# We have consecutive records with the same email. Append the code.
$current_record[15] .= $fields[15];
}
# Remember the last processed email. When it changes we will print @current_record.
$last_email = $fields[0];
}
# Print the last record
print_record
sort *.csv | ./script.pl