Warning: file_get_contents(/data/phpspider/zhask/data//catemap/7/css/38.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Hadoop 简单pig脚本中出现错误_Hadoop_Apache Pig - Fatal编程技术网

Hadoop 简单pig脚本中出现错误

Hadoop 简单pig脚本中出现错误,hadoop,apache-pig,Hadoop,Apache Pig,这是我的全部剧本。它应该查看一个项目Gutenberg etext,去掉页眉和页脚文本,只留下书的实际文本,这样就可以用于进一步的分析 ebook = LOAD '$ebook' USING PigStorage AS (line:chararray); ranked = RANK ebook; header = FILTER ranked BY SUBSTRING(line,0,41)=='*** START OF THIS PROJECT GUTENBERG EBOOK'; hlines

这是我的全部剧本。它应该查看一个项目Gutenberg etext,去掉页眉和页脚文本,只留下书的实际文本,这样就可以用于进一步的分析

ebook = LOAD '$ebook' USING PigStorage AS (line:chararray);
ranked = RANK ebook;

header = FILTER ranked BY SUBSTRING(line,0,41)=='*** START OF THIS PROJECT GUTENBERG EBOOK';
hlines = FOREACH header GENERATE $0;
headers = RANK hlines;
--STORE headers INTO '/user/PHIBBS/headers' USING PigStorage;

footer = FILTER ranked BY SUBSTRING(line,0,39)=='*** END OF THIS PROJECT GUTENBERG EBOOK';
flines = FOREACH footer GENERATE $0;
footers = RANK flines;
--STORE footers INTO '/user/PHIBBS/footers' USING PigStorage;

blocks =  JOIN headers BY $0, footers BY $0;
sectioned = CROSS blocks, ranked;
--STORE sectioned INTO '/user/PHIBBS/sectioned';

book = FILTER sectioned BY $4 > $1 AND $4 < $3;
STORE book INTO '/user/PHIBBS/clean/$ebook';
它失败,错误为org.apache.pig.tools.grunt.grunt-错误2017:创建作业配置的内部错误


如果我试着只运行脚本的一个子集,在最后一行之前都没有问题。如果试着运行前5行加上注释掉的存储行,就可以了。如果我运行接下来的3行加上下一个被注释掉的存储行,它就会崩溃。如果我禁用任何一个存储行,它都可以正常工作。因此,每个单独的存储语句都没有问题。他们两个?错误2017!有什么建议吗?我尝试了两个不同的发行版,一个来自Hortonworks,另一个来自Cloudera,从他们各自的网站上新下载的干净VM图像。

鉴于你的目标是去掉页眉/页脚,只保存这本书,你不需要存储任何东西,只需要保存这本书和页眉/页脚。我认为您的问题是blocks=将页眉连接到$0,页脚连接到$0;它对只加载一次的数据执行自联接。我下载了《战争与和平》,这段代码对我很有用

$ pig -x local
# grunt>

ebook = LOAD 'pg2600.txt' USING PigStorage() AS (line:chararray);
ranked = RANK ebook;

header = FILTER ranked BY SUBSTRING(line, 0, 41) == 'START OF THIS PROJECT GUTENBERG EBOOK';
hlines = FOREACH header GENERATE $0;
headers = RANK hlines;
STORE headers INTO 'headers' USING PigStorage();

footer = filter ranked by SUBSTRING(line, 0, 39) == 'END OF THIS PROJECT GUTENBERG EBOOK';
flines = FOREACH footer GENERATE $0;
footers = RANK flines;
STORE footers INTO 'footers' USING PigStorage();

/* Now re-load headers and footers for join */

h_new = LOAD 'headers/part-m-00000' USING PigStorage() AS (id:int, col1:int);
f_new = LOAD 'footers/part-m-00000' USING PigStorage() AS (id:int, col1:int);

blocks = JOIN h_new BY id, f_new BY id;
sectioned = CROSS blocks, ranked;
book = FILTER sectioned BY $4 > $1 AND $4 < $3;
STORE book INTO '__book__';

如果您将原始输入读取到两个不同的变量,它也应该工作得很好

ebook_header = LOAD 'pg2600.txt' USING PigStorage() AS (line:chararray);
ebook_footer = LOAD 'pg2600.txt' USING PigStorage() AS (line:chararray);

并对这些应用相应的过滤器。我想这取决于具体情况,即最好读入一次,创建两个输出,然后再次读入,还是读入两次。

我只是将中间数据存储起来,作为测试它是否正常工作。如果根本没有写出来,最终的写操作无论如何都会失败。即使只是试着把这两个写出来,在所有的评论都被写出来之后,没有做任何事情,第二次写还是失败了。要么单独写,另一个被注释掉,效果很好。它应该也可以吗?我不希望它也能正常工作,因为它不起作用。我希望它能更好地工作,比如说,工作!写出来自同一源数据集的多个数据集是否存在已知问题?很抱歉,答案令人困惑。我想在前面的回答中提到这一点,我认为您的问题是blocks=将页眉连接到$0,页脚连接到$0;建议的解决方案是存储中间结果。另一个选项是将原始数据读入两个不同的变量。我的问题与你的问题非常相似,两种方法似乎都能解决我的问题。在我的ubuntu和aws上。我认为问题在于连接没有写入,但是连接只有在需要输出时才真正完成。