Apache pig 如何获取pig中每行的字数?
我试图计算出pig文件中每行有多少单词。我已经完成了加载和拆分:Apache pig 如何获取pig中每行的字数?,apache-pig,Apache Pig,我试图计算出pig文件中每行有多少单词。我已经完成了加载和拆分: raw = load file; words = FOREACH raw GENERATE TOKENIZE(*); 这给了我一袋薄纱,每个薄纱上都有一个单词。然后我去计算这些项目,我得到一个错误: counts = FOREACH words GENERATE COUNT(*); org.apache.pig.backend.executionengine.ExecException: ERROR 2106: Error w
raw = load file;
words = FOREACH raw GENERATE TOKENIZE(*);
这给了我一袋薄纱,每个薄纱上都有一个单词。然后我去计算这些项目,我得到一个错误:
counts = FOREACH words GENERATE COUNT(*);
org.apache.pig.backend.executionengine.ExecException: ERROR 2106: Error while computing count in COUNT
...
Caused by: java.lang.NullPointerException
我得到一个错误:
counts = FOREACH words GENERATE COUNT(*);
org.apache.pig.backend.executionengine.ExecException: ERROR 2106: Error while computing count in COUNT
...
Caused by: java.lang.NullPointerException
是不是因为有些队伍有空袋子?或者我还有别的地方做错了吗?如果是空包的问题,那么你可以尝试以下方法:(未测试) 在这里,我们编写if-else条件来检查标记化的单词是null还是空的,如果是,那么我们将为它指定零,否则总计数。您可以这样尝试吗 输入
Hi hello how are you
this is apache pig
works
like a charm
Pigscript:
A = LOAD 'input' AS (line:chararray);
B = FOREACH A GENERATE TOKENIZE(line);
C = FOREACH B GENERATE COUNT($0);
DUMP C;
(5)
(4)
(1)
()
(3)
输出:
A = LOAD 'input' AS (line:chararray);
B = FOREACH A GENERATE TOKENIZE(line);
C = FOREACH B GENERATE COUNT($0);
DUMP C;
(5)
(4)
(1)
()
(3)
您不应该像这样使用COUNT(*),这在Pig中是受限的。