Apache pig ApachePig-如何从CSV文件读取数据

Apache pig ApachePig-如何从CSV文件读取数据,apache-pig,Apache Pig,ApachePig-如何从CSV文件中读取数据,数据可以选择包含在双引号中 样本数据如下: "Traditional",0.03,"Department, of Housing and Urban Development (HUD)",0.01 预期产出: Traditional 0.03 Department, of Housing and Urban Development (HUD) 0.01 在上面的示例中,我们有4列。2用双引号括起来,2不是,并且是浮动数据类型。此外,第三

ApachePig-如何从CSV文件中读取数据,数据可以选择包含在双引号中

样本数据如下:

"Traditional",0.03,"Department, of Housing and Urban Development (HUD)",0.01 
预期产出:

Traditional  0.03  Department, of Housing and Urban Development (HUD)  0.01
在上面的示例中,我们有4列。2用双引号括起来,2不是,并且是浮动数据类型。此外,第三列中的数据本身有一个逗号

请帮助我了解一些与Pig相关的API(示例代码),这将有助于正确分割数据,并使用位置符号(如$0、$1、$2、$3)处理它们


我已经从
PiggyBank
探索了
CSVExcelStorage
CSVLoader
,但我无法正确拆分。

选项1–使用CSVLoader或CSVExcelStorage

 REGISTER piggybank.jar;
 DEFINE CSVLoader org.apache.pig.piggybank.storage.CSVLoader();

 a = load 'data' USING CSVLoader(',') AS (field1:chararray,field2:double,
                                          field3:chararray,field4:chararray);

 b = FOREACH a GENERATE $0,$1,$2,$3;

 DUMP b;
选项2–文本加载器+STRSPLIT+替换

 A = LOAD '/path/to/files/' USING TextLoader() AS (line:chararray);

 B = FOREACH A GENERATE REPLACE(line,'"','');

 C = FOREACH B GENERATE FLATTEN(STRSPLIT(line, ','));

 DUMP C;

来源:

如果它解释了它所提供的代码的功能,那么这可能是一个更好的答案。
a = LOAD 'filename.csv' USING PigStorage (',') AS (fieldname:chararray, fieldname2:float);

DUMP a;