Apache pig 计算数据中有多少不同长度的单词,例如,(8,1)(单词,长度)

Apache pig 计算数据中有多少不同长度的单词,例如,(8,1)(单词,长度),apache-pig,Apache Pig,函数应输出一对,格式和示例或类似内容,如 要获得Pig中字符串“theWord”的长度,需要使用每个单词的函数大小。要将一个单词的大小与字符串“Length”连接起来,您需要对每个大小使用函数CONCAT。最后,我知道,为了将一个整数转换为字符串,以便将其与另一个字符串连接起来,并将其转换为(CHARARRAY)。例如,我将使用“(CHARARRAY)size(word)” 我已经编写了代码,但当我尝试转储数据时,它并没有达到我预期的效果。我想我可能需要执行计数函数,但我对此感到有点困惑 p1

函数应输出一对,格式和示例或类似内容,如

要获得Pig中字符串“theWord”的长度,需要使用每个单词的函数大小。要将一个单词的大小与字符串“Length”连接起来,您需要对每个大小使用函数CONCAT。最后,我知道,为了将一个整数转换为字符串,以便将其与另一个字符串连接起来,并将其转换为(CHARARRAY)。例如,我将使用“(CHARARRAY)size(word)”

我已经编写了代码,但当我尝试转储数据时,它并没有达到我预期的效果。我想我可能需要执行计数函数,但我对此感到有点困惑

p1 = LOAD 'poems/input/Poem1.txt' USING TextLoader AS(line:Chararray);
p2 = LOAD 'poems/input/Poem2.txt' USING TextLoader AS(line:Chararray);
p3 = LOAD 'poems/input/Poem3.txt' USING TextLoader AS(line:Chararray);
p4 = LOAD 'poems/input/Poem4.txt' USING TextLoader AS(line:Chararray);
p5 = LOAD 'poems/input/Poem5.txt' USING TextLoader AS(line:Chararray);
p6 = LOAD 'poems/input/Poem6.txt' USING TextLoader AS(line:Chararray);
p = UNION p1, p2, p3, p4, p5, p6;
words = foreach p generate flatten(TOKENIZE(line , ' ,;:!?\t\n\r\f\\.\\-')) as word;
words_lower = foreach words generate LOWER(word) as word_lower;
words_unique = group words_lower by word_lower;
words_with_size = foreach words_unique generate SIZE(words_lower) as size, group;
words_with_size_concat = CONCAT words_with_count BY (CHARARRAY)size(words_lower) DESC, group;

我想出来了!代码应该是这样的:

p1 = LOAD 'poems/input/Poem1.txt' USING TextLoader AS(line:Chararray);
p2 = LOAD 'poems/input/Poem2.txt' USING TextLoader AS(line:Chararray);
p3 = LOAD 'poems/input/Poem3.txt' USING TextLoader AS(line:Chararray);
p4 = LOAD 'poems/input/Poem4.txt' USING TextLoader AS(line:Chararray);
p5 = LOAD 'poems/input/Poem5.txt' USING TextLoader AS(line:Chararray);
p6 = LOAD 'poems/input/Poem6.txt' USING TextLoader AS(line:Chararray);
p = UNION p1, p2, p3, p4, p5, p6;
words = foreach p generate flatten(TOKENIZE(line , ' ,;:!?\t\n\r\f\\.\\-')) as word;
words_lower = foreach words generate LOWER(word) as word_lower;
words_length = foreach words generate CONCAT('Length ', (CHARARRAY)SIZE(word)) as word_length;
words_unique = group words_length by word_length 
words_with_count = foreach words_unique generate COUNT(words_length) as cnt, group;
words_with_count_sorted = ORDER words_with_count BY cnt DESC, group;
store words_with_count_sorted into 'poems/output/wordcount1';

正如我所玩的那样,我不需要words\u unique,但我需要添加words\u length=foreach words generate CONCAT('length',(CHARARRAY)SIZE(word))作为word\u length;您应该使用一个load语句和一个通配符作为文件名,比如使用TextLoader As(line:CHARARRAY)的p=load'poems/input/Poem*.txt;不需要联合。