Linux Awk/Perl将文本文件转换为合理格式的csv
我有一个历史自动生成的日志文件,格式如下,我想在上传到数据库之前将其转换为csv文件Linux Awk/Perl将文本文件转换为合理格式的csv,linux,perl,awk,Linux,Perl,Awk,我有一个历史自动生成的日志文件,格式如下,我想在上传到数据库之前将其转换为csv文件 -------------------------------------- Thu Jul 8 09:34:12 BST 2010 BLUE Head 1 Duration = 20 s Activity = 14.9 MBq Sensitivity = 312 cps/MBq -------------------------------------- Thu Jul 8 09:34:55 BST 20
--------------------------------------
Thu Jul 8 09:34:12 BST 2010
BLUE Head 1
Duration = 20 s
Activity = 14.9 MBq
Sensitivity = 312 cps/MBq
--------------------------------------
Thu Jul 8 09:34:55 BST 2010
BLUE Head 1
Duration = 20 s
Activity = 14.9 MBq
Sensitivity = 318 cps/MBq
--------------------------------------
Thu Jul 8 10:13:39 BST 2010
RED Head 1
Duration = 20 s
Activity = 14.9 MBq
Sensitivity = 307 cps/MBq
--------------------------------------
Thu Jul 8 10:14:10 BST 2010
RED Head 1
Duration = 20 s
Activity = 14.9 MBq
Sensitivity = 305 cps/MBq
--------------------------------------
Mon Jul 19 10:11:18 BST 2010
BLUE Head 1
Duration = 20 s
Activity = 12.4 MBq
Sensitivity = 326 cps/MBq
--------------------------------------
Mon Jul 19 10:12:09 BST 2010
BLUE Head 1
Duration = 20 s
Activity = 12.4 MBq
Sensitivity = 333 cps/MBq
--------------------------------------
Mon Jul 19 10:13:57 BST 2010
RED Head 1
Duration = 20 s
Activity = 12.4 MBq
Sensitivity = 338 cps/MBq
--------------------------------------
Mon Jul 19 10:14:45 BST 2010
RED Head 1
Duration = 20 s
Activity = 12.4 MBq
Sensitivity = 340 cps/MBq
--------------------------------------
我想将日志文件转换为以下格式
Date,Camera,Head,Duration,Activity
08/07/10,BLUE,1,20,14.9
08/07/10,BLUE,1,20,14.9
08/07/10,RED,1,20,14.9
08/07/10,RED,1,20,14.9
我用awk让我接近我的愿望
awk 'BEGIN {print "Date,Camera,Head,Duration,Activity";RS = "--------------------------------------"; FS="\n";}; {OFS=",";split($3, a, " ");split($4,b, " "); split($5,c," ");print $2,a[1],a[3],b[3],c[3]}' sensitivity.txt > sensitivity.csv
这让我
Date,Camera,Head,Duration,Activity
,,,,
Thu Jul 8 09:34:12 BST 2010,BLUE,1,20,14.9
Thu Jul 8 09:34:55 BST 2010,BLUE,1,20,14.9
Thu Jul 8 10:13:39 BST 2010,RED,1,20,14.9
Thu Jul 8 10:14:10 BST 2010,RED,1,20,14.9
我怎么能
a去掉第4行中的4个输出字段分隔符
b将日期格式从2010年7月8日星期四09:34:12英国夏令时转换为DD/MM/YY我可以用纯awk或通过管道传输到perl来完成这项工作吗
BEGIN {
n=split("Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec",month,"|")
for (i=1;i<=n;i++) {
month_index[month[i]] = i
}
print "Date,Camera,Head,Duration,Activity"
}
/^-*$/{
i=0
next
}
{
i++
}
i==1{
printf "%02d/%02d/%02d,",$3,month_index[$2],substr($6,3)
}
i==2{
printf "%s,%d,",$1,$3
}
i==3{
printf "%d,",$3
}
i==4{
printf "%.1f\n",$3
}
此直接的awk脚本将完成以下工作:
BEGIN {
n=split("Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec",month,"|")
for (i=1;i<=n;i++) {
month_index[month[i]] = i
}
print "Date,Camera,Head,Duration,Activity"
}
/^-*$/{
i=0
next
}
{
i++
}
i==1{
printf "%02d/%02d/%02d,",$3,month_index[$2],substr($6,3)
}
i==2{
printf "%s,%d,",$1,$3
}
i==3{
printf "%d,",$3
}
i==4{
printf "%.1f\n",$3
}
@sudo_O的回答很好,但这里有一个替代方案:
$ cat tst.awk
BEGIN{ RS="---+\n"; OFS=","; months="JanFebMarAprMayJunJulAugSepOctNovDec" }
NR==1{ print "Date","Camera","Head","Duration","Activity"; next }
{ print sprintf("%04d%02d%02d",$6,(match(months,$2)+2)/3,$3),$7,$9,$12,$16 }
$ gawk -f tst.awk file
Date,Camera,Head,Duration,Activity
20100708,BLUE,1,20,14.9
20100708,BLUE,1,20,14.9
20100708,RED,1,20,14.9
20100708,RED,1,20,14.9
20100719,BLUE,1,20,12.4
20100719,BLUE,1,20,12.4
20100719,RED,1,20,12.4
20100719,RED,1,20,12.4
请注意,我在上面使用了GNU awk,因此可以将RS设置为多个字符。对于其他awk,只需将所有--…s行转换为空行或控制字符或其他内容,并在运行脚本之前相应地设置RS
如果您不喜欢我建议的日期格式,请调整sprintf以适应。@sudo\u O的答案很好,但这里有一个替代方案:
$ cat tst.awk
BEGIN{ RS="---+\n"; OFS=","; months="JanFebMarAprMayJunJulAugSepOctNovDec" }
NR==1{ print "Date","Camera","Head","Duration","Activity"; next }
{ print sprintf("%04d%02d%02d",$6,(match(months,$2)+2)/3,$3),$7,$9,$12,$16 }
$ gawk -f tst.awk file
Date,Camera,Head,Duration,Activity
20100708,BLUE,1,20,14.9
20100708,BLUE,1,20,14.9
20100708,RED,1,20,14.9
20100708,RED,1,20,14.9
20100719,BLUE,1,20,12.4
20100719,BLUE,1,20,12.4
20100719,RED,1,20,12.4
20100719,RED,1,20,12.4
请注意,我在上面使用了GNU awk,因此可以将RS设置为多个字符。对于其他awk,只需将所有--…s行转换为空行或控制字符或其他内容,并在运行脚本之前相应地设置RS
如果您不喜欢我建议的日期格式,请调整sprintf以适应。我想我应该展示如何实际解析输入,而不仅仅是执行字符串转换
#! /usr/bin/env perl
use strict;
use warnings;
use Date::Parse;
use Date::Format;
use Text::CSV;
sub convert_date{
my $time = str2time($_[0]);
# iso 8601 style:
return time2str('%Y-%m-%d',$time); # YYYY-MM-DD
# or the outdated style output you wanted
return time2str('%d/%m/%y',$time); # DD/MM/YY
}
my %multiply_table = (
s => 1,
m => 60,
h => 60 * 60,
d => 60 * 60 * 24,
);
sub convert_duration{
my($d,$s) = $_[0] =~ /^ \s* (\d+) \s* (\w) \s* $/x;
die "Invalid duration '$_[0]'" unless $d && $s;
return $d * $multiply_table{$s};
}
my @field_list = qw'Date Camera Head Duration Activity';
my $csv = Text::CSV->new( { eol => "\n" } );
# print header
$csv->print( \*STDOUT, \@field_list );
# set record separator
local $/ = ('-' x 38) . "\n";
# parse data
while(<>){
chomp; # remove record separator
next unless $_; # skip empty section
my($time,$camdat,@fields) = split m/\n/; # split up the fields
my %data;
# split camera and head fields
@data{qw(Camera Head)} = split /\s+Head\s+/, $camdat;
# parse lines like:
# Duration = 20 s
# Activity = 14.9 MBq
# Sensitivity = 305 cps/MBq
for(@fields){
my($key,$value) = /(\w+) \s* = \s* (.*) /x;
$data{$key} = $value;
}
# at this point we start reducing precision
$data{Date} = convert_date( $time );
# remove measurement units
$data{Duration} = convert_duration($data{Duration}); # safe
$data{Activity} =~ s/[^\d]*$//; # unsafe
$csv->print(\*STDOUT, [@data{@field_list}]);
}
我想我应该展示如何实际解析输入,而不仅仅是执行字符串转换
#! /usr/bin/env perl
use strict;
use warnings;
use Date::Parse;
use Date::Format;
use Text::CSV;
sub convert_date{
my $time = str2time($_[0]);
# iso 8601 style:
return time2str('%Y-%m-%d',$time); # YYYY-MM-DD
# or the outdated style output you wanted
return time2str('%d/%m/%y',$time); # DD/MM/YY
}
my %multiply_table = (
s => 1,
m => 60,
h => 60 * 60,
d => 60 * 60 * 24,
);
sub convert_duration{
my($d,$s) = $_[0] =~ /^ \s* (\d+) \s* (\w) \s* $/x;
die "Invalid duration '$_[0]'" unless $d && $s;
return $d * $multiply_table{$s};
}
my @field_list = qw'Date Camera Head Duration Activity';
my $csv = Text::CSV->new( { eol => "\n" } );
# print header
$csv->print( \*STDOUT, \@field_list );
# set record separator
local $/ = ('-' x 38) . "\n";
# parse data
while(<>){
chomp; # remove record separator
next unless $_; # skip empty section
my($time,$camdat,@fields) = split m/\n/; # split up the fields
my %data;
# split camera and head fields
@data{qw(Camera Head)} = split /\s+Head\s+/, $camdat;
# parse lines like:
# Duration = 20 s
# Activity = 14.9 MBq
# Sensitivity = 305 cps/MBq
for(@fields){
my($key,$value) = /(\w+) \s* = \s* (.*) /x;
$data{$key} = $value;
}
# at this point we start reducing precision
$data{Date} = convert_date( $time );
# remove measurement units
$data{Duration} = convert_duration($data{Duration}); # safe
$data{Activity} =~ s/[^\d]*$//; # unsafe
$csv->print(\*STDOUT, [@data{@field_list}]);
}
对于日期转换,请看这里的第一个答案,对于无用的逗号,只需检查$2、a、b、c等值即可。。。如果$2{print…}我想我没有见过自2000年以来有人要求将日期中的4位数年份转换为2位数年份。认真考虑使用YYYYMMD日期格式,这样你就可以区分1999和2099,并按日期对数据进行琐碎排序。对于日期转换,请看这里的第一个答案,对于无用逗号,只需检查2美元、A、B、C等的值。如果$2{print…}我想我没有见过自2000年以来有人要求将日期中的4位数年份转换为2位数年份。认真考虑使用YYYYMMD日期格式,这样你就可以区分1999到2099,并按日期对数据进行琐碎排序。你的月数组将包含24个条目而不是12个条目。只要OP不想打印所有的月份,它就可以工作。考虑使用2个数组用于MunthnR2nm和MunthnM2nr。@ EdMordon,我知道,我只是懒惰,我不认为OP想要迭代数组,但只是在情况下改变它。@ Sudosio -同意,但其他人寻找类似问题的答案可能会看到它。大多数情况下,我认为如果你有两个单独的数组,这会有助于清晰,即使拆分中使用的一个数组名为tmp或其他名称。你的月份数组将包含24个条目,而不是12个条目。只要OP不想打印所有的月份,它就可以工作。考虑使用2个数组用于MunthnR2nm和MunthnM2nr。@ EdMordon,我知道,我只是懒惰,我不认为OP想要迭代数组,但只是在情况下改变它。@ Sudosio -同意,但其他人寻找类似问题的答案可能会看到它。大多数情况下,我认为如果您有两个单独的数组,即使拆分中使用的一个数组名为tmp或其他名称,也会有助于清晰。