Hadoop 简单pig脚本中出现错误
这是我的全部剧本。它应该查看一个项目Gutenberg etext,去掉页眉和页脚文本,只留下书的实际文本,这样就可以用于进一步的分析Hadoop 简单pig脚本中出现错误,hadoop,apache-pig,Hadoop,Apache Pig,这是我的全部剧本。它应该查看一个项目Gutenberg etext,去掉页眉和页脚文本,只留下书的实际文本,这样就可以用于进一步的分析 ebook = LOAD '$ebook' USING PigStorage AS (line:chararray); ranked = RANK ebook; header = FILTER ranked BY SUBSTRING(line,0,41)=='*** START OF THIS PROJECT GUTENBERG EBOOK'; hlines
ebook = LOAD '$ebook' USING PigStorage AS (line:chararray);
ranked = RANK ebook;
header = FILTER ranked BY SUBSTRING(line,0,41)=='*** START OF THIS PROJECT GUTENBERG EBOOK';
hlines = FOREACH header GENERATE $0;
headers = RANK hlines;
--STORE headers INTO '/user/PHIBBS/headers' USING PigStorage;
footer = FILTER ranked BY SUBSTRING(line,0,39)=='*** END OF THIS PROJECT GUTENBERG EBOOK';
flines = FOREACH footer GENERATE $0;
footers = RANK flines;
--STORE footers INTO '/user/PHIBBS/footers' USING PigStorage;
blocks = JOIN headers BY $0, footers BY $0;
sectioned = CROSS blocks, ranked;
--STORE sectioned INTO '/user/PHIBBS/sectioned';
book = FILTER sectioned BY $4 > $1 AND $4 < $3;
STORE book INTO '/user/PHIBBS/clean/$ebook';
它失败,错误为org.apache.pig.tools.grunt.grunt-错误2017:创建作业配置的内部错误
如果我试着只运行脚本的一个子集,在最后一行之前都没有问题。如果试着运行前5行加上注释掉的存储行,就可以了。如果我运行接下来的3行加上下一个被注释掉的存储行,它就会崩溃。如果我禁用任何一个存储行,它都可以正常工作。因此,每个单独的存储语句都没有问题。他们两个?错误2017!有什么建议吗?我尝试了两个不同的发行版,一个来自Hortonworks,另一个来自Cloudera,从他们各自的网站上新下载的干净VM图像。鉴于你的目标是去掉页眉/页脚,只保存这本书,你不需要存储任何东西,只需要保存这本书和页眉/页脚。我认为您的问题是blocks=将页眉连接到$0,页脚连接到$0;它对只加载一次的数据执行自联接。我下载了《战争与和平》,这段代码对我很有用
$ pig -x local
# grunt>
ebook = LOAD 'pg2600.txt' USING PigStorage() AS (line:chararray);
ranked = RANK ebook;
header = FILTER ranked BY SUBSTRING(line, 0, 41) == 'START OF THIS PROJECT GUTENBERG EBOOK';
hlines = FOREACH header GENERATE $0;
headers = RANK hlines;
STORE headers INTO 'headers' USING PigStorage();
footer = filter ranked by SUBSTRING(line, 0, 39) == 'END OF THIS PROJECT GUTENBERG EBOOK';
flines = FOREACH footer GENERATE $0;
footers = RANK flines;
STORE footers INTO 'footers' USING PigStorage();
/* Now re-load headers and footers for join */
h_new = LOAD 'headers/part-m-00000' USING PigStorage() AS (id:int, col1:int);
f_new = LOAD 'footers/part-m-00000' USING PigStorage() AS (id:int, col1:int);
blocks = JOIN h_new BY id, f_new BY id;
sectioned = CROSS blocks, ranked;
book = FILTER sectioned BY $4 > $1 AND $4 < $3;
STORE book INTO '__book__';
如果您将原始输入读取到两个不同的变量,它也应该工作得很好
ebook_header = LOAD 'pg2600.txt' USING PigStorage() AS (line:chararray);
ebook_footer = LOAD 'pg2600.txt' USING PigStorage() AS (line:chararray);
并对这些应用相应的过滤器。我想这取决于具体情况,即最好读入一次,创建两个输出,然后再次读入,还是读入两次。我只是将中间数据存储起来,作为测试它是否正常工作。如果根本没有写出来,最终的写操作无论如何都会失败。即使只是试着把这两个写出来,在所有的评论都被写出来之后,没有做任何事情,第二次写还是失败了。要么单独写,另一个被注释掉,效果很好。它应该也可以吗?我不希望它也能正常工作,因为它不起作用。我希望它能更好地工作,比如说,工作!写出来自同一源数据集的多个数据集是否存在已知问题?很抱歉,答案令人困惑。我想在前面的回答中提到这一点,我认为您的问题是blocks=将页眉连接到$0,页脚连接到$0;建议的解决方案是存储中间结果。另一个选项是将原始数据读入两个不同的变量。我的问题与你的问题非常相似,两种方法似乎都能解决我的问题。在我的ubuntu和aws上。我认为问题在于连接没有写入,但是连接只有在需要输出时才真正完成。