Warning: file_get_contents(/data/phpspider/zhask/data//catemap/0/email/3.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Apache pig 用pig做单词计数_Apache Pig - Fatal编程技术网

Apache pig 用pig做单词计数

Apache pig 用pig做单词计数,apache-pig,Apache Pig,我已经以以下形式处理了数据: ( id ,{ bag of words}) 例如: (foobar, {(foo), (foo),(foobar),(bar)}) (foo,{(bar),(bar)}) 等等。。 请给我描述一下: processed: {id: chararray,tokens: {tuple_of_tokens: (token: chararray)}} 现在我想要的是。。同时计算一个单词在该数据中出现的次数,并将其输出为 foobar, foo, 2 foobar,f

我已经以以下形式处理了数据:

( id ,{ bag of words})
例如:

(foobar, {(foo), (foo),(foobar),(bar)})
(foo,{(bar),(bar)})
等等。。 请给我描述一下:

processed: {id: chararray,tokens: {tuple_of_tokens: (token: chararray)}}
现在我想要的是。。同时计算一个单词在该数据中出现的次数,并将其输出为

foobar, foo, 2
foobar,foobar,1
foobar,bar,1
foo,bar,2

and so on...

如何在pig中执行此操作?

虽然可以在纯pig中执行此操作,但使用UDF应该更有效。大致如下:

@outputschema('wordcounts: {T:(word:chararray, count:int)}')
def generate_wordcount(BAG):
    d = {}
    for word in BAG:
        if word in d:
            d[word] += 1
        else:
            d[word] = 1
    return d.items()
然后,您可以这样使用此自定义项:

REGISTER 'myudfs.py' USING jython AS myudfs ;

-- A: (id, words: {T:(word:chararray)})

B = FOREACH A GENERATE id, FLATTEN(myudfs.generate_wordcount(words)) ;
试试这个:

$ cat input 
foobar  foo
foobar  foo
foobar  foobar
foobar  bar
foo bar
foo bar

--preparing
inputs = LOAD 'input' AS (first: chararray, second: chararray);
grouped = GROUP inputs BY first;
formatted = FOREACH grouped GENERATE group, inputs.second AS second;
--what you need
flattened = FOREACH formatted GENERATE group, FLATTEN(second);
result = FOREACH (GROUP flattened BY (group, second)) GENERATE FLATTEN(group), COUNT(flattened);
DUMP result;
输出:

(foo,bar,2)
(foobar,bar,1)
(foobar,foo,2)
(foobar,foobar,1)