Warning: file_get_contents(/data/phpspider/zhask/data//catemap/0/hadoop/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Hadoop 如何使用ApachePig获得类似SQL的组?_Hadoop_Bigdata_Apache Pig_Data Science - Fatal编程技术网

Hadoop 如何使用ApachePig获得类似SQL的组?

Hadoop 如何使用ApachePig获得类似SQL的组?,hadoop,bigdata,apache-pig,data-science,Hadoop,Bigdata,Apache Pig,Data Science,我有以下名为movieUserTagFltr的输入: (260,{(260,starwars),(260,George Lucas),(260,sci-fi),(260,cult classic),(260,Science Fiction),(260,classic),(260,supernatural powers),(260,nerdy),(260,Science Fiction),(260,critically acclaimed),(260,Science Fiction),(260,a

我有以下名为movieUserTagFltr的输入:

(260,{(260,starwars),(260,George Lucas),(260,sci-fi),(260,cult classic),(260,Science Fiction),(260,classic),(260,supernatural powers),(260,nerdy),(260,Science Fiction),(260,critically acclaimed),(260,Science Fiction),(260,action),(260,script),(260,"imaginary world),(260,space),(260,Science Fiction),(260,"space epic),(260,Syfy),(260,series),(260,classic sci-fi),(260,space adventure),(260,jedi),(260,awesome soundtrack),(260,awesome),(260,coming of age)})
(858,{(858,Katso Sanna!)})
(924,{(924,slow),(924,boring)})
(1256,{(1256,Marx Brothers)})
它遵循模式:
(movieId:int,tags:bag{(movieId:int,tag:cararray),…}

基本上,第一个数字代表一个电影id,随后的包包含与该电影相关的所有关键字。我想对这些关键词进行分组,这样我就有了这样的输出:

(260,{(1,starwars),(1,George Lucas),(1,sci-fi),(1,cult classic),(4,Science Fiction),(1,classic),(1,supernatural powers),(1,nerdy),(1,critically acclaimed),(1,action),(1,script),(1,"imaginary world),(1,space),(1,"space epic),(1,Syfy),(1,series),(1,classic sci-fi),(1,space adventure),(1,jedi),(1,awesome soundtrack),(1,awesome),(1,coming of age)})
(858,{(1,Katso Sanna!)})
(924,{(1,slow),(1,boring)})
(1256,{(1,Marx Brothers)})
请注意,id为260的电影中,标签科幻小说已经出现了4次。使用GROUP BY和COUNT,我使用以下脚本为每部电影计算不同的关键字:

sum = FOREACH group_data { 
    unique_tags = DISTINCT movieUserTagFltr.tags::tag;
    GENERATE group, COUNT(unique_tags) as tag;
};
但这只返回一个全局计数,我需要一个局部计数。所以我想的逻辑是:

result = iterate over each tuple of group_data {
    generate a tuple with $0, and a bag with {
        foreach distinct tag that group_data has on it's $1 variable do {
            generate a tuple like: (tag_name, count of how many times that tag appeared on $1)
        }
    }
}

您可以展平原始输入,以便每个
movieID
标记都是它们自己的记录。然后按
movieID
tag
分组,以获得每个组合的计数。最后,按
movieID
分组,这样你就可以得到一包标签和每部电影的计数

假设您从您描述的模式开始使用
movieUserTagFltr

A = FOREACH movieUserTagFltr GENERATE FLATTEN(tags) AS (movieID, tag);
B = GROUP A BY (movieID, tag);
C = FOREACH B GENERATE
    FLATTEN(group) AS (movieID, tag),
    COUNT(A) AS movie_tag_count;
D = GROUP C BY movieID;
您的最终模式是:

D: {group: int,C: {(movieID: int,tag: chararray,movie_tag_count: long)}}

您可以展平原始输入,以便每个
movieID
标记都是它们自己的记录。然后按
movieID
tag
分组,以获得每个组合的计数。最后,按
movieID
分组,这样你就可以得到一包标签和每部电影的计数

假设您从您描述的模式开始使用
movieUserTagFltr

A = FOREACH movieUserTagFltr GENERATE FLATTEN(tags) AS (movieID, tag);
B = GROUP A BY (movieID, tag);
C = FOREACH B GENERATE
    FLATTEN(group) AS (movieID, tag),
    COUNT(A) AS movie_tag_count;
D = GROUP C BY movieID;
您的最终模式是:

D: {group: int,C: {(movieID: int,tag: chararray,movie_tag_count: long)}}