Apache pig ApachePig-如何从CSV文件读取数据
ApachePig-如何从CSV文件中读取数据,数据可以选择包含在双引号中 样本数据如下:Apache pig ApachePig-如何从CSV文件读取数据,apache-pig,Apache Pig,ApachePig-如何从CSV文件中读取数据,数据可以选择包含在双引号中 样本数据如下: "Traditional",0.03,"Department, of Housing and Urban Development (HUD)",0.01 预期产出: Traditional 0.03 Department, of Housing and Urban Development (HUD) 0.01 在上面的示例中,我们有4列。2用双引号括起来,2不是,并且是浮动数据类型。此外,第三
"Traditional",0.03,"Department, of Housing and Urban Development (HUD)",0.01
预期产出:
Traditional 0.03 Department, of Housing and Urban Development (HUD) 0.01
在上面的示例中,我们有4列。2用双引号括起来,2不是,并且是浮动数据类型。此外,第三列中的数据本身有一个逗号
请帮助我了解一些与Pig相关的API(示例代码),这将有助于正确分割数据,并使用位置符号(如$0、$1、$2、$3)处理它们
我已经从
PiggyBank
探索了CSVExcelStorage
和CSVLoader
,但我无法正确拆分。选项1–使用CSVLoader或CSVExcelStorage
REGISTER piggybank.jar;
DEFINE CSVLoader org.apache.pig.piggybank.storage.CSVLoader();
a = load 'data' USING CSVLoader(',') AS (field1:chararray,field2:double,
field3:chararray,field4:chararray);
b = FOREACH a GENERATE $0,$1,$2,$3;
DUMP b;
选项2–文本加载器+STRSPLIT+替换
A = LOAD '/path/to/files/' USING TextLoader() AS (line:chararray);
B = FOREACH A GENERATE REPLACE(line,'"','');
C = FOREACH B GENERATE FLATTEN(STRSPLIT(line, ','));
DUMP C;
来源:如果它解释了它所提供的代码的功能,那么这可能是一个更好的答案。
a = LOAD 'filename.csv' USING PigStorage (',') AS (fieldname:chararray, fieldname2:float);
DUMP a;