Apache Pig CsvExcelStorage

Apache Pig CsvExcelStorage,csv,apache-pig,Csv,Apache Pig,这是我的数据: "2001-08-10 01:09:29","3820166553","<P><TABLE border=0 cellSpacing=0 cellPadding=0 width="100%"><TBODY><TR><TD style="BACKGROUND-COLOR: #eee"> <TABLE border=0 cellSpacing=0 cellPadding=0 align=center><TB

这是我的数据:

"2001-08-10 01:09:29","3820166553","<P><TABLE border=0 cellSpacing=0 cellPadding=0 width="100%"><TBODY><TR><TD style="BACKGROUND-COLOR: #eee">
<TABLE border=0 cellSpacing=0 cellPadding=0 align=center><TBODY><TR><TD style="BACKGROUND-COLOR: #fff" class=maincontent><TABLE style="MARGIN: 0px auto" border=0 cellSpacing=0 cellPadding=0 width=580 align=center><TBODY><TR><TD colSpan=2><A href="general/forms/YHMN?ci=0c01026bd7fbc394&amp;ags=D778083&amp;cn=KOBY&amp;pfnn="><IMG border=0 alt="Thanks for contacting us click here button" src="http:/edm/tcp-images/yhan-edm-banner-purple_button.jpg" width=580 height=255></A></TD></TR><TR><TD style="PADDING-BOTTOM: 35px; PADDING-LEFT: 32px; PADDING-RIGHT: 32px; PADDING-TOP: 35px" colSpan=2><P style="TEXT-ALIGN: left; LINE-HEIGHT: 18px; MARGIN-TOP: 0px; FONT-FAMILY: Arial, Helvetica, sans-serif; COLOR: #414141; FONT-SIZE: 14px" align=left></P><P style="TEXT-ALIGN: left; LINE-HEIGHT: 18px; FONT-FAMILY: Arial, Helvetica, sans-serif; COLOR: #414141; FONT-SIZE: 14px" align=left>If...",,"778083"
输出:

stage = LOAD '/filename' USING org.apache.pig.piggybank.storage.CSVExcelStorage(',','YES_MULTILINE','UNIX','READ_INPUT_HEADER') as (col1,col2,col3,col4,col5);
col3_data=FOREACH stage GENERATE col3;
DUMP col3_data;
(<P><TABLE border=0 cellSpacing=0 cellPadding=0 width=100%"><TBODY><TR><TD style=BACKGROUND-COLOR: #eee">
<TABLE border=0 cellSpacing=0 cellPadding=0 align=center><TBODY><TR><TD style=BACKGROUND-COLOR: #fff" class=maincontent><TABLE style=MARGIN: 0px auto" border=0 cellSpacing=0 cellPadding=0 width=580 align=center><TBODY><TR><TD colSpan=2><A href=general/forms/YHMN?ci=0c01026bd7fbc394&amp;ags=D778083&amp;cn=KOBY&amp;pfnn="><IMG border=0 alt=Thanks for contacting us click here button" src=http:/edm/tcp-images/yhan-edm-banner-purple_button.jpg" width=580 height=255></A></TD></TR><TR><TD style=PADDING-BOTTOM: 35px; PADDING-LEFT: 32px; PADDING-RIGHT: 32px; PADDING-TOP: 35px" colSpan=2><P style=TEXT-ALIGN: left; LINE-HEIGHT: 18px; MARGIN-TOP: 0px; FONT-FAMILY: Arial)
"2015-08-17 23:55:59","12345","<P>this is test data,
<TR>
<\TR><BODY>Text-Align: Arial, Roman,feed this; end of input...","column 4"
csvexceldata = LOAD 'csvdata.csv' USING org.apache.pig.piggybank.storage.CSVExcelStorage(',','YES_MULTILINE','NOCHANGE','READ_INPUT_HEADER') AS (col1,col2,col3,col4);
col3_data = FOREACH csvexceldata GENERATE col3;
DUMP col3_data;
(<P>this is test data,
<TR>
<\TR><BODY>Text-Align: Arial, Roman,feed this; end of input...)
加载后,当我转储数据时,我得到以下结果

col3 = <P>this is test data,
<TR>
<\TR><BODY>"Text-Align: Arial
因此,
col3
包含不完整的数据,
col4
包含错误的数据


请您帮助确定此处的错误。

尝试了上述用例,唯一的更改是对CSVExcelStorage的引用。它获取字段3的预期值

我相信您已经将别名定义为csvExcelStorage来引用此类

Pig版本:ApachePIG版本0.14.0(r1640057)于2014年11月16日18:01:24编译。

输入:

stage = LOAD '/filename' USING org.apache.pig.piggybank.storage.CSVExcelStorage(',','YES_MULTILINE','UNIX','READ_INPUT_HEADER') as (col1,col2,col3,col4,col5);
col3_data=FOREACH stage GENERATE col3;
DUMP col3_data;
(<P><TABLE border=0 cellSpacing=0 cellPadding=0 width=100%"><TBODY><TR><TD style=BACKGROUND-COLOR: #eee">
<TABLE border=0 cellSpacing=0 cellPadding=0 align=center><TBODY><TR><TD style=BACKGROUND-COLOR: #fff" class=maincontent><TABLE style=MARGIN: 0px auto" border=0 cellSpacing=0 cellPadding=0 width=580 align=center><TBODY><TR><TD colSpan=2><A href=general/forms/YHMN?ci=0c01026bd7fbc394&amp;ags=D778083&amp;cn=KOBY&amp;pfnn="><IMG border=0 alt=Thanks for contacting us click here button" src=http:/edm/tcp-images/yhan-edm-banner-purple_button.jpg" width=580 height=255></A></TD></TR><TR><TD style=PADDING-BOTTOM: 35px; PADDING-LEFT: 32px; PADDING-RIGHT: 32px; PADDING-TOP: 35px" colSpan=2><P style=TEXT-ALIGN: left; LINE-HEIGHT: 18px; MARGIN-TOP: 0px; FONT-FAMILY: Arial)
"2015-08-17 23:55:59","12345","<P>this is test data,
<TR>
<\TR><BODY>Text-Align: Arial, Roman,feed this; end of input...","column 4"
csvexceldata = LOAD 'csvdata.csv' USING org.apache.pig.piggybank.storage.CSVExcelStorage(',','YES_MULTILINE','NOCHANGE','READ_INPUT_HEADER') AS (col1,col2,col3,col4);
col3_data = FOREACH csvexceldata GENERATE col3;
DUMP col3_data;
(<P>this is test data,
<TR>
<\TR><BODY>Text-Align: Arial, Roman,feed this; end of input...)
输出:

stage = LOAD '/filename' USING org.apache.pig.piggybank.storage.CSVExcelStorage(',','YES_MULTILINE','UNIX','READ_INPUT_HEADER') as (col1,col2,col3,col4,col5);
col3_data=FOREACH stage GENERATE col3;
DUMP col3_data;
(<P><TABLE border=0 cellSpacing=0 cellPadding=0 width=100%"><TBODY><TR><TD style=BACKGROUND-COLOR: #eee">
<TABLE border=0 cellSpacing=0 cellPadding=0 align=center><TBODY><TR><TD style=BACKGROUND-COLOR: #fff" class=maincontent><TABLE style=MARGIN: 0px auto" border=0 cellSpacing=0 cellPadding=0 width=580 align=center><TBODY><TR><TD colSpan=2><A href=general/forms/YHMN?ci=0c01026bd7fbc394&amp;ags=D778083&amp;cn=KOBY&amp;pfnn="><IMG border=0 alt=Thanks for contacting us click here button" src=http:/edm/tcp-images/yhan-edm-banner-purple_button.jpg" width=580 height=255></A></TD></TR><TR><TD style=PADDING-BOTTOM: 35px; PADDING-LEFT: 32px; PADDING-RIGHT: 32px; PADDING-TOP: 35px" colSpan=2><P style=TEXT-ALIGN: left; LINE-HEIGHT: 18px; MARGIN-TOP: 0px; FONT-FAMILY: Arial)
"2015-08-17 23:55:59","12345","<P>this is test data,
<TR>
<\TR><BODY>Text-Align: Arial, Roman,feed this; end of input...","column 4"
csvexceldata = LOAD 'csvdata.csv' USING org.apache.pig.piggybank.storage.CSVExcelStorage(',','YES_MULTILINE','NOCHANGE','READ_INPUT_HEADER') AS (col1,col2,col3,col4);
col3_data = FOREACH csvexceldata GENERATE col3;
DUMP col3_data;
(<P>this is test data,
<TR>
<\TR><BODY>Text-Align: Arial, Roman,feed this; end of input...)

这是测试数据, 文本对齐:Arial,罗马,输入此字符;输入结束…)


哪个版本的清管器?能否将阶段别名的转储结果添加到问题中?