Python 如何找到路径流并使用pig或hive对其进行排序?
下面是我的用例示例 您可以参考OP询问类似问题的地方。如果我正确理解了您的问题,那么您希望从路径中删除重复项,但仅当它们彼此相邻时。所以Python 如何找到路径流并使用pig或hive对其进行排序?,python,hive,apache-pig,aggregate-functions,dense-rank,Python,Hive,Apache Pig,Aggregate Functions,Dense Rank,下面是我的用例示例 您可以参考OP询问类似问题的地方。如果我正确理解了您的问题,那么您希望从路径中删除重复项,但仅当它们彼此相邻时。所以1->1->2->1将变成1->2->1。如果这是正确的,那么您不能仅仅分组和区分(我相信您已经注意到),因为它将删除所有重复项。一个简单的解决方案是编写一个UDF来删除这些重复项,同时保留用户的不同路径 UDF: package something; import java.util.ArrayList; import org.apache.hadoop.
1->1->2->1
将变成1->2->1
。如果这是正确的,那么您不能仅仅分组和区分(我相信您已经注意到),因为它将删除所有重复项。一个简单的解决方案是编写一个UDF来删除这些重复项,同时保留用户的不同路径
UDF:
package something;
import java.util.ArrayList;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;
public class RemoveSequentialDuplicatesUDF extends UDF {
public ArrayList<Text> evaluate(ArrayList<Text> arr) {
ArrayList<Text> newList = new ArrayList<Text>();
newList.add(arr.get(0));
for (int i = 1; i < arr.size(); i++) {
String front = arr.get(i).toString();
String back = arr.get(i-1).toString();
if (!back.equals(front)) {
newList.add(arr.get(i));
}
}
return newList;
}
}
add jar /path/to/jars/brickhouse-0.7.1.jar;
add jar /path/to/jars/hive_common-SNAPSHOT.jar;
create temporary function collect as "brickhouse.udf.collect.CollectUDAF";
create temporary function remove_dups as "something.RemoveSequentialDuplicatesUDF";
select screen_flow, count
, dense_rank() over (order by count desc) rank
from (
select screen_flow
, count(*) count
from (
select session_id
, concat_ws("->", remove_dups(screen_array)) screen_flow
from (
select session_id
, collect(screen_name) screen_array
from (
select *
from database.table
order by screen_launch_time ) a
group by session_id ) b
) c
group by screen_flow ) d
s1->s2->s3 2 1
s1->s2 1 2
s1->s2->s3->s1 1 2
输出:
package something;
import java.util.ArrayList;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;
public class RemoveSequentialDuplicatesUDF extends UDF {
public ArrayList<Text> evaluate(ArrayList<Text> arr) {
ArrayList<Text> newList = new ArrayList<Text>();
newList.add(arr.get(0));
for (int i = 1; i < arr.size(); i++) {
String front = arr.get(i).toString();
String back = arr.get(i-1).toString();
if (!back.equals(front)) {
newList.add(arr.get(i));
}
}
return newList;
}
}
add jar /path/to/jars/brickhouse-0.7.1.jar;
add jar /path/to/jars/hive_common-SNAPSHOT.jar;
create temporary function collect as "brickhouse.udf.collect.CollectUDAF";
create temporary function remove_dups as "something.RemoveSequentialDuplicatesUDF";
select screen_flow, count
, dense_rank() over (order by count desc) rank
from (
select screen_flow
, count(*) count
from (
select session_id
, concat_ws("->", remove_dups(screen_array)) screen_flow
from (
select session_id
, collect(screen_name) screen_array
from (
select *
from database.table
order by screen_launch_time ) a
group by session_id ) b
) c
group by screen_flow ) d
s1->s2->s3 2 1
s1->s2 1 2
s1->s2->s3->s1 1 2
希望这有帮助。输入
990004916946605-1404157897784,S1,1404157898275
990004916946605-1404157897784,S1,1404157898286
990004916946605-1404157897784,S2,1404157898337
990004947764274-1435162269418,S1,1435162274044
990004947764274-1435162269418,S2,1435162274057
990004947764274-1435162269418,S3,1435162274081
990004947764274-1435162287965,S2,1435162690002
990004947764274-1435162287965,S1,1435162690001
990004947764274-1435162287965,S3,1435162690003
990004947764274-1435162287965,S1,1435162690004
990004947764274-1435162212345,S1,1435168768574
990004947764274-1435162212345,S2,1435168768585
990004947764274-1435162212345,S3,1435168768593
register /home/cloudera/jar/ScreenFilter.jar;
screen_records = LOAD '/user/cloudera/inputfiles/screen.txt' USING PigStorage(',') AS(session_id:chararray,screen_name:chararray,launch_time:long);
screen_rec_order = ORDER screen_records by launch_time ASC;
session_grped = GROUP screen_rec_order BY session_id;
eached = FOREACH session_grped
{
ordered = ORDER screen_rec_order by launch_time;
GENERATE group as session_id, REPLACE(BagToString(ordered.screen_name),'_','-->') as screen_str;
};
screen_each = FOREACH eached GENERATE session_id, GetOrderedScreen(screen_str) as screen_pattern;
screen_grp = GROUP screen_each by screen_pattern;
screen_final_each = FOREACH screen_grp GENERATE group as screen_pattern, COUNT(screen_each) as pattern_cnt;
ranker = RANK screen_final_each BY pattern_cnt DESC DENSE;
output_data = FOREACH ranker GENERATE screen_pattern, pattern_cnt, $0 as rank_value;
dump output_data;
我无法找到使用Pig内置函数删除同一会话id的相邻屏幕的方法,因此我使用JAVA UDF来删除相邻的屏幕名称
我创建了一个名为GetOrderedScreen的JAVA UDF,并将该UDF添加到jar中,将该jar命名为ScreenFilter.jar,并在这个Pig脚本中注册了该jar
下面是GetOrderedScreen Java UDF的代码
public class GetOrderedScreen extends EvalFunc<String> {
@Override
public String exec(Tuple input) throws IOException {
String incoming_screen_str= (String)input.get(0);
String outgoing_screen_str ="";
String screen_array[] =incoming_screen_str.split("-->");
String full_screen=screen_array[0];
for (int i=0; i<screen_array.length;i++)
{
String prefix_screen= screen_array[i];
String suffix_screen="";
int j=i+1;
if(j< screen_array.length)
{
suffix_screen = screen_array[j];
}
if (!prefix_screen.equalsIgnoreCase(suffix_screen))
{
full_screen = full_screen+ "-->" +suffix_screen;
}
}
outgoing_screen_str =full_screen.substring(0, full_screen.lastIndexOf("-->"));
return outgoing_screen_str;
}
希望这对你有帮助!。。再等一段时间,一些看到这个问题的好头脑会有效地回答(没有JAVA UDF)Ok。。我得到了要求。。尝试了一段时间以获得实际输出。。看起来我们需要一个java UDF来删除相邻的相同屏幕名。。让我再考虑一下。我想给出一个解决方案,而不是UDF@explorethis:是否可以添加纯文本格式的输入和预期输出?我想试试这个用例。@MuraliRao:我不能按同样的顺序格式化。它以纯文本的形式放置。如果你能分享你的电子邮件,我将为此发送一份word文档:)@Surender Raja-你能给我关于以下场景Geat的建议吗!!我将尝试一下,并让你知道:)根据我所看到的,我理解你的方法,希望能解决我的问题。太棒了!!它很好吃。感谢GoBrewerI我需要你的专业知识来解决我的新帖子中的问题。你能用pig重写上面的蜂巢逻辑吗?太棒了!!让我试试这个。一个小问题。类似于排名,我也需要级别。如何从上面的输出中找到级别?例如,(S1-->S2-->S3为级别2)(S1-->S2为级别1)(S1-->S2-->S3-->S1为级别3)…基本上计算箭头的数量(->)。。我编写了一个python脚本来计算子字符串以获得结果。谢谢你的帮助!我现在正试图在上面的pig脚本中添加两个新的列application_name和date,但却遇到了麻烦。你能帮忙吗?你的申请名称和日期是什么。。是否要对所有输出记录保持相同