Parsing 使用换行符和附加逗号解析csv文件
我当前的csv文件如下所示:Parsing 使用换行符和附加逗号解析csv文件,parsing,csv,Parsing,Csv,我当前的csv文件如下所示: field1, field1, field3, field4, field5, field6 111, John, Doctor, 1A-jrd, ,"Tuft St Peoria, IL 54345 (12.11111, 43.5555)" 121, Bob, Teacher, 2A-abcd, 345, "Moore Ave Boston, MA 23123 (67.11111,- 49.5567)" 131, Kyle, Engin
field1, field1, field3, field4, field5, field6
111, John, Doctor, 1A-jrd, ,"Tuft St
Peoria, IL 54345
(12.11111, 43.5555)"
121, Bob, Teacher, 2A-abcd, 345, "Moore Ave
Boston, MA 23123
(67.11111,- 49.5567)"
131, Kyle, Engineer, 3A-bhbh, , "Barnes St
San Francisco, CA 34654
(65.11111, 55.985432)"
在某些情况下,字段5没有值。此外,字段6在引号内,并有换行符。例如:第一行数据的字段6实际上是
"Tuft St
Peoria, IL 54345
(12.11111, 43.5575)"
我需要编写一个脚本来解析这个文件,并返回12.111,43.557来代替field6的当前值,这样最终的csv文件看起来就像
field1, field1, field3, field4, field5, field6
111, John, Doctor, 1A-jrd, , "12.111, 43.555"
121, Bob, Teacher, 2A-abcd, 345, "67.111,- 49.556"
131, Kyle, Engineer, 3A-bhbh, , "65.111, 55.985"
我看过cvsparser,但我的理解是,只有当整个数据行都在一行且没有任何换行符时,它才有效。此外,我不能简单地使用逗号分割行,因为有些地址中有多个逗号。有关于如何解析此csv文件的建议吗?您不能。由于在字段6中允许逗号,因此这是一个有效文件 A、 B、C、D、E a、 b、c、d、e、f
您无法确定此文件是否包含一个或两个条目,因为第一个数据集的字段6可以是“E”或“E\n a、b、c、d、E、f”您可以使用
csv
库进行此操作
import csv
with open('myfile.csv') as myfile:
csv_file = csv.reader(myfile, delimiter = ',')
现在您有了行,可以随心所欲了。您需要一个CSV解析器来解析这类数据。我建议使用perl和Text::CSV: 大概是这样的:
#!/usr/bin/env perl
use strict;
use warnings;
use Text::CSV;
my $csv = Text::CSV -> new( { 'binary' => 1, eol => "\n" } );
open ( my $input_fh, '<', "sample.csv" ) or die $!;
my $header = $csv -> getline ( $input_fh );
$csv -> print ( \*STDOUT, $header );
while ( my $row = $csv -> getline ( $input_fh ) ) {
$row -> [5] =~ s,.*\(,\(,ms;
$csv -> print ( \*STDOUT, $row );
}
产出:
field1," field1"," field3"," field4"," field5"," field6 "
111," John"," Doctor"," 1A-jrd"," ","(12.11111, 43.5555)"
121," Bob"," Teacher"," 2A-abcd",345,"(67.11111,- 49.5567)"
131," Kyle"," Engineer"," 3A-bhbh"," ","(65.11111, 55.985432)"
希望您能够清楚地了解如何进一步修改“field6”以精确地满足您的规范 对于这种“非结构化csv”格式,您可以使用一个Perl接口,一个通用BNF解析器 数据可以在中描述为
this::=that
(~
运算符定义词汇规则)。::=
规则中的参数,例如(头[\n])
表示“不包含在解析结果中”
解析器返回一个数据结构(以[id,child1,child2…]
格式的数组数组),从中可以提取数据
您还可以在相同或单独的包中定义为Perlsub
,以处理数据
下面是基于您的数据的示例脚本及其输出
脚本:
使用5.010;
严格使用;
使用警告;
使用数据::转储程序;
$Data::Dumper::Indent=1;
$Data::Dumper::Terse=1;
$Data::Dumper::Deepcopy=1;
使用Marpa::R2;
我的$g=Marpa::R2::Scanless::g->new({source=>\([name,value]latm=>1
csv::=(标题[\n])行
标题::=列+分隔符=>列
列_sep~','
列~'字段'[1-6]
行::=行+分隔符=>[\n]
行::=字段1_5(',')字段6
字段_sep~','
字段1_5::=field1_5+分隔符=>field_sep
字段1|5~num | word |代码
字段6~地址
num~[\d]+
单词~[A-Za-z]+
代码~num word'-'单词
地址~'“‘地址’
地址_chars~[^\“]+#”
:放弃~空格
空间~''
_源的结束_
} );
我的$input=0})};
输出:
[
'csv',
[
'lines',
[
'line',
[
'fields1_5',
[
'field1_5',
'111'
],
[
'field1_5',
'John'
],
[
'field1_5',
'Doctor'
],
[
'field1_5',
'1A-jrd'
]
],
[
'field6',
'"Tuft St
Peoria, IL 54345
(12.11111, 43.5555)"'
]
],
[
'line',
[
'fields1_5',
[
'field1_5',
'121'
],
[
'field1_5',
'Bob'
],
[
'field1_5',
'Teacher'
],
[
'field1_5',
'2A-abcd'
],
[
'field1_5',
'345'
]
],
[
'field6',
'"Moore Ave
Boston, MA 23123
(67.11111,- 49.5567)"'
]
],
[
'line',
[
'fields1_5',
[
'field1_5',
'131'
],
[
'field1_5',
'Kyle'
],
[
'field1_5',
'Engineer'
],
[
'field1_5',
'3A-bhbh'
]
],
[
'field6',
'"Barnes St
San Francisco, CA 34654
(65.11111, 55.985432)"'
]
]
]
]
什么语言?在Perl中,我会考虑使用Text::CSV模块。CSV文件中的字段真的没有引用吗?这使得准确解析非常困难(如果不是不可能的话)。@Sobrique:任何语言都可以。@rici:只有包含多个逗号的字段才有引号,如field6。包含多个逗号的字段(如field6)有引号。但field6中有额外的逗号。我不能简单地解析逗号。然后使用另一个分隔符
[
'csv',
[
'lines',
[
'line',
[
'fields1_5',
[
'field1_5',
'111'
],
[
'field1_5',
'John'
],
[
'field1_5',
'Doctor'
],
[
'field1_5',
'1A-jrd'
]
],
[
'field6',
'"Tuft St
Peoria, IL 54345
(12.11111, 43.5555)"'
]
],
[
'line',
[
'fields1_5',
[
'field1_5',
'121'
],
[
'field1_5',
'Bob'
],
[
'field1_5',
'Teacher'
],
[
'field1_5',
'2A-abcd'
],
[
'field1_5',
'345'
]
],
[
'field6',
'"Moore Ave
Boston, MA 23123
(67.11111,- 49.5567)"'
]
],
[
'line',
[
'fields1_5',
[
'field1_5',
'131'
],
[
'field1_5',
'Kyle'
],
[
'field1_5',
'Engineer'
],
[
'field1_5',
'3A-bhbh'
]
],
[
'field6',
'"Barnes St
San Francisco, CA 34654
(65.11111, 55.985432)"'
]
]
]
]