Parsing 使用换行符和附加逗号解析csv文件

Parsing 使用换行符和附加逗号解析csv文件,parsing,csv,Parsing,Csv,我当前的csv文件如下所示: field1, field1, field3, field4, field5, field6 111, John, Doctor, 1A-jrd, ,"Tuft St Peoria, IL 54345 (12.11111, 43.5555)" 121, Bob, Teacher, 2A-abcd, 345, "Moore Ave Boston, MA 23123 (67.11111,- 49.5567)" 131, Kyle, Engin

我当前的csv文件如下所示:

field1, field1, field3, field4, field5, field6  
111, John, Doctor, 1A-jrd, ,"Tuft St  
Peoria, IL 54345  
(12.11111, 43.5555)"  
121, Bob, Teacher, 2A-abcd, 345, "Moore Ave  
Boston, MA 23123  
(67.11111,- 49.5567)"  
131, Kyle, Engineer, 3A-bhbh, , "Barnes St  
San Francisco, CA 34654  
(65.11111, 55.985432)"  
在某些情况下,字段5没有值。此外,字段6在引号内,并有换行符。例如:第一行数据的字段6实际上是

"Tuft St  
Peoria, IL 54345  
(12.11111, 43.5575)"  
我需要编写一个脚本来解析这个文件,并返回12.111,43.557来代替field6的当前值,这样最终的csv文件看起来就像

field1, field1, field3, field4, field5, field6  
111, John, Doctor, 1A-jrd, , "12.111, 43.555"  
121, Bob, Teacher, 2A-abcd, 345, "67.111,- 49.556"  
131, Kyle, Engineer, 3A-bhbh, , "65.111, 55.985"  

我看过cvsparser,但我的理解是,只有当整个数据行都在一行且没有任何换行符时,它才有效。此外,我不能简单地使用逗号分割行,因为有些地址中有多个逗号。有关于如何解析此csv文件的建议吗?

您不能。由于在字段6中允许逗号,因此这是一个有效文件

A、 B、C、D、E a、 b、c、d、e、f


您无法确定此文件是否包含一个或两个条目,因为第一个数据集的字段6可以是“E”或“E\n a、b、c、d、E、f”

您可以使用
csv
库进行此操作

import csv

with open('myfile.csv') as myfile:
     csv_file = csv.reader(myfile, delimiter = ',')

现在您有了行,可以随心所欲了。

您需要一个CSV解析器来解析这类数据。我建议使用perl和Text::CSV:

大概是这样的:

#!/usr/bin/env perl
use strict;
use warnings;

use Text::CSV; 

my $csv = Text::CSV -> new( { 'binary' => 1, eol => "\n" } ); 

open ( my $input_fh, '<', "sample.csv" ) or die $!; 

my $header = $csv -> getline ( $input_fh );
$csv -> print ( \*STDOUT, $header );

while ( my $row = $csv -> getline ( $input_fh ) ) { 
    $row -> [5] =~ s,.*\(,\(,ms;
    $csv -> print ( \*STDOUT, $row );
}
产出:

field1," field1"," field3"," field4"," field5"," field6  "
111," John"," Doctor"," 1A-jrd"," ","(12.11111, 43.5555)"
121," Bob"," Teacher"," 2A-abcd",345,"(67.11111,- 49.5567)"
131," Kyle"," Engineer"," 3A-bhbh"," ","(65.11111, 55.985432)"

希望您能够清楚地了解如何进一步修改“field6”以精确地满足您的规范

对于这种“非结构化csv”格式,您可以使用一个Perl接口,一个通用BNF解析器

数据可以在中描述为
this::=that
~
运算符定义词汇规则)。
::=
规则中的参数,例如
(头[\n])
表示“不包含在解析结果中”

解析器返回一个数据结构(以
[id,child1,child2…]
格式的数组数组),从中可以提取数据

您还可以在相同或单独的包中定义为Perl
sub
,以处理数据

下面是基于您的数据的示例脚本及其输出

脚本:

使用5.010;
严格使用;
使用警告;
使用数据::转储程序;
$Data::Dumper::Indent=1;
$Data::Dumper::Terse=1;
$Data::Dumper::Deepcopy=1;
使用Marpa::R2;
我的$g=Marpa::R2::Scanless::g->new({source=>\([name,value]latm=>1
csv::=(标题[\n])行
标题::=列+分隔符=>列
列_sep~','
列~'字段'[1-6]
行::=行+分隔符=>[\n]
行::=字段1_5(',')字段6
字段_sep~','
字段1_5::=field1_5+分隔符=>field_sep
字段1|5~num | word |代码
字段6~地址
num~[\d]+
单词~[A-Za-z]+
代码~num word'-'单词
地址~'“‘地址’
地址_chars~[^\“]+#”
:放弃~空格
空间~''
_源的结束_
} );
我的$input=0})};
输出:

[
  'csv',
  [
    'lines',
    [
      'line',
      [
        'fields1_5',
        [
          'field1_5',
          '111'
        ],
        [
          'field1_5',
          'John'
        ],
        [
          'field1_5',
          'Doctor'
        ],
        [
          'field1_5',
          '1A-jrd'
        ]
      ],
      [
        'field6',
        '"Tuft St
Peoria, IL 54345
(12.11111, 43.5555)"'
      ]
    ],
    [
      'line',
      [
        'fields1_5',
        [
          'field1_5',
          '121'
        ],
        [
          'field1_5',
          'Bob'
        ],
        [
          'field1_5',
          'Teacher'
        ],
        [
          'field1_5',
          '2A-abcd'
        ],
        [
          'field1_5',
          '345'
        ]
      ],
      [
        'field6',
        '"Moore Ave
Boston, MA 23123
(67.11111,- 49.5567)"'
      ]
    ],
    [
      'line',
      [
        'fields1_5',
        [
          'field1_5',
          '131'
        ],
        [
          'field1_5',
          'Kyle'
        ],
        [
          'field1_5',
          'Engineer'
        ],
        [
          'field1_5',
          '3A-bhbh'
        ]
      ],
      [
        'field6',
        '"Barnes St
San Francisco, CA 34654
(65.11111, 55.985432)"'
      ]
    ]
  ]
]

什么语言?在Perl中,我会考虑使用Text::CSV模块。CSV文件中的字段真的没有引用吗?这使得准确解析非常困难(如果不是不可能的话)。@Sobrique:任何语言都可以。@rici:只有包含多个逗号的字段才有引号,如field6。包含多个逗号的字段(如field6)有引号。但field6中有额外的逗号。我不能简单地解析逗号。然后使用另一个分隔符
[
  'csv',
  [
    'lines',
    [
      'line',
      [
        'fields1_5',
        [
          'field1_5',
          '111'
        ],
        [
          'field1_5',
          'John'
        ],
        [
          'field1_5',
          'Doctor'
        ],
        [
          'field1_5',
          '1A-jrd'
        ]
      ],
      [
        'field6',
        '"Tuft St
Peoria, IL 54345
(12.11111, 43.5555)"'
      ]
    ],
    [
      'line',
      [
        'fields1_5',
        [
          'field1_5',
          '121'
        ],
        [
          'field1_5',
          'Bob'
        ],
        [
          'field1_5',
          'Teacher'
        ],
        [
          'field1_5',
          '2A-abcd'
        ],
        [
          'field1_5',
          '345'
        ]
      ],
      [
        'field6',
        '"Moore Ave
Boston, MA 23123
(67.11111,- 49.5567)"'
      ]
    ],
    [
      'line',
      [
        'fields1_5',
        [
          'field1_5',
          '131'
        ],
        [
          'field1_5',
          'Kyle'
        ],
        [
          'field1_5',
          'Engineer'
        ],
        [
          'field1_5',
          '3A-bhbh'
        ]
      ],
      [
        'field6',
        '"Barnes St
San Francisco, CA 34654
(65.11111, 55.985432)"'
      ]
    ]
  ]
]