Apache pig 清管器过滤器与下一行的关系相同

Apache pig 清管器过滤器与下一行的关系相同,apache-pig,Apache Pig,我花了很长时间来解决我的问题,但几乎没有发现任何有用的东西。 希望你们中的一些人能给我小费 我与以下格式有关系:用户名、时间戳、ip 例如: Harald 2014-02-18T16:14:49.503Z 123.123.123.123 Harald 2014-02-18T16:14:51.503Z 123.123.123.123 Harald 2014-02-18T16:14:55.503Z 321.321.321.321 我想知道,谁在不到5秒内改变了他的ip地址。所以第二排和第三排应该很

我花了很长时间来解决我的问题,但几乎没有发现任何有用的东西。 希望你们中的一些人能给我小费

我与以下格式有关系:用户名、时间戳、ip

例如:

Harald 2014-02-18T16:14:49.503Z 123.123.123.123
Harald 2014-02-18T16:14:51.503Z 123.123.123.123
Harald 2014-02-18T16:14:55.503Z 321.321.321.321
我想知道,谁在不到5秒内改变了他的ip地址。所以第二排和第三排应该很有趣

我想按用户名对关系进行分组,并想比较实际行和下一行的时间戳。如果ip地址不相同,且时间戳大于小于5秒,则应在输出处显示

有人能帮我解决这个问题吗

问候


首先我要感谢你抽出时间

但实际上我还是停留在会话部分

这是我的数据提交:

aoebcu  2014-02-19T14:23:17.503Z    220.61.65.25
aoebcu  2014-02-19T14:23:14.503Z    222.117.144.19
aoebcu  2014-02-19T14:23:14.503Z    222.117.144.19
jekgru  2014-02-19T14:23:14.503Z    213.56.157.109
zmembx  2014-02-19T14:23:12.503Z    199.188.198.91
qhixcg  2014-02-19T14:23:11.503Z    203.40.104.119
到目前为止,我的代码如下所示:

hijack_Reduced = FOREACH finalLogs GENERATE ClientUserName, timestamp, OriginalClientIP;
hijack_Filtered = FILTER hijack_Reduced BY OriginalClientIP != '-';

hijack_Sessionized = FOREACH (GROUP hijack_Filtered BY ClientUserName) {
  views = ORDER hijack_Filtered BY timestamp;
  GENERATE FLATTEN(Sessionize(views)) AS (ClientUserName,timestamp,OriginalClientIP,session_id);
}
但当我运行此脚本时,我收到以下错误消息:

15:36:22错误- org.apache.pig.tools.pigstats.SimplePostStats.setBackendException(542) |错误0:执行[POUserFunc(名称: POUserFunc(datafu.pig.sessions.sessione)[bag]-scope-199运算符 关键字:scope-199)子项:在[]处为空: java.lang.IllegalArgumentException:无效格式:“aoebcu”

我已经试了很多,但都没用。 你有什么想法吗


关于

虽然您可以为此编写一个UDF,但实际上可以利用中已有的UDF来解决此问题

我的解决方案包括对数据应用会话。基本上,您可以查看连续事件并为每个事件分配一个会话ID。如果两个事件之间经过的时间超过指定的时间量(在您的情况下为5秒),则下一个事件将获得一个新的会话ID。否则,连续事件将获得相同的会话ID。一旦为每个事件分配了会话ID,其余的就很容易了。我们按会话ID分组,查找具有多个不同IP地址的会话

我将仔细检查我的解决方案

假设您有以下输入数据。哈罗德和库马尔都改变了他们的IP地址。但是哈罗德在5秒钟内完成了,而库马尔没有。所以我们脚本的输出应该只是简单的“Harold”

加载数据

data = LOAD 'input' using PigStorage(',') 
       AS (user:chararray,time:chararray,ip:chararray);
现在从DataFu定义几个UDF。UDF执行我前面描述的会话化。UDF将用于在每个会话中查找不同的IP地址

define Sessionize datafu.pig.sessions.Sessionize('5s');

define DistinctBy datafu.pig.bags.DistinctBy('1');
按用户分组数据,按时间排序,并应用Sessonize自定义项。请注意,时间戳必须是第一个字段,因为这是Sessione所期望的。此UDF将会话ID附加到每个元组

data = FOREACH data GENERATE time,user,ip;

data_sessionized = FOREACH (GROUP data BY user) {
  views = ORDER data BY time;
  GENERATE flatten(Sessionize(views)) as (time,user,ip,session_id);
}
现在,数据已会话化,我们可以按用户和会话进行分组。我也按用户分组,因为我想把这个值吐出来。我们通过UDF将事件包传递到distinctb。有关更详细的说明,请查看此UDF的文档。但从本质上讲,我们将获得与每个会话具有不同IP地址一样多的元组。请注意,我已从下面的关系中删除了时间。这是因为1)它不是必需的,2)DataFu的1.2.0中的DistinctBy在处理包含破折号的字段时有一个bug,就像时间字段一样

data_sessionized = FOREACH data_sessionized GENERATE user,ip,session_id;

data_sessionized = FOREACH (GROUP data_sessionized BY (user, session_id)) GENERATE
  group.user as user,
  SIZE(DistinctBy(data_sessionized)) as distinctIpCount;
现在选择具有多个不同IP地址的所有会话,并返回这些会话的不同用户

data_sessionized = FILTER data_sessionized BY distinctIpCount > 1;

data_sessionized = FOREACH data_sessionized GENERATE user;

data_sessionized = DISTINCT data_sessionized;
这只会产生:

Harold
以下是完整的源代码,您应该能够将其直接粘贴到DataFu单元测试中并运行:

  /**
  define Sessionize datafu.pig.sessions.Sessionize('5s');

  define DistinctBy datafu.pig.bags.DistinctBy('1'); -- distinct by ip

  data = LOAD 'input' using PigStorage(',') AS (user:chararray,time:chararray,ip:chararray);

  data = FOREACH data GENERATE time,user,ip;

  data_sessionized = FOREACH (GROUP data BY user) {
    views = ORDER data BY time;
    GENERATE flatten(Sessionize(views)) as (time,user,ip,session_id);
  }

  data_sessionized = FOREACH data_sessionized GENERATE user,ip,session_id;

  data_sessionized = FOREACH (GROUP data_sessionized BY (user, session_id)) GENERATE
    group.user as user,
    SIZE(DistinctBy(data_sessionized)) as distinctIpCount;

  data_sessionized = FILTER data_sessionized BY distinctIpCount > 1;

  data_sessionized = FOREACH data_sessionized GENERATE user;

  data_sessionized = DISTINCT data_sessionized;

  STORE data_sessionized INTO 'output';
   */
  @Multiline private String sessionizeUserIpTest;

  private String[] sessionizeUserIpTestData = new String[] {
      "Harold,2014-02-18T16:14:49.503Z,123.123.123.123",
      "Harold,2014-02-18T16:14:51.503Z,123.123.123.123",
      "Harold,2014-02-18T16:14:55.503Z,321.321.321.321",
      "Kumar,2014-02-18T16:14:49.503Z,123.123.123.123",
      "Kumar,2014-02-18T16:14:55.503Z,123.123.123.123",
      "Kumar,2014-02-18T16:15:05.503Z,321.321.321.321"
  };

  @Test
  public void sessionizeUserIpTest() throws Exception
  {
    PigTest test = createPigTestFromString(sessionizeUserIpTest);

    this.writeLinesToFile("input", 
        sessionizeUserIpTestData);

    List<Tuple> result = this.getLinesForAlias(test, "data_sessionized");

    assertEquals(result.size(),1);
    assertEquals(result.get(0).get(0),"Harold");
  }
/**
定义sessionizedatafu.pig.sessions.Sessionize('5s');
通过datafu.pig.bags.DistinctBy('1')定义DistinctBy;--按ip区分
数据=使用PigStorage(',')作为(用户:chararray,时间:chararray,ip:chararray)加载“输入”;
数据=FOREACH数据生成时间、用户、ip;
data_sessionized=FOREACH(按用户分组数据){
视图=按时间排列的订单数据;
生成扁平化(会话化(视图))为(时间、用户、ip、会话id);
}
data_sessionized=FOREACH data_sessionized生成用户、ip、会话id;
data_sessionized=FOREACH(组数据_sessionized BY(用户,会话id))生成
group.user作为用户,
大小(通过(数据会话化)区分)作为区分计数;
data_sessionized=通过DifferenticPCount>1进行过滤数据_sessionized;
data_sessionized=FOREACH data_sessionized生成用户;
data_sessionized=不同的数据_sessionized;
将数据存储到“输出”中;
*/
@多行私有字符串sessioneuseriptest;
私有字符串[]sessionEUseripTestData=新字符串[]{
“哈罗德,2014-02-18T16:14:49.503Z,123.123.123.123”,
“哈罗德,2014-02-18T16:14:51.503Z,123.123.123.123”,
“哈罗德,2014-02-18T16:14:55.503Z,321.321.321.321”,
“库马尔,2014-02-18T16:14:49.503Z,123.123.123.123”,
“库马尔,2014-02-18T16:14:55.503Z,123.123.123.123”,
“库马尔,2014-02-18T16:15:05.503Z,321.321.321.321”
};
@试验
public void sessioneuseriptest()引发异常
{
PigTest test=createPigTestFromString(sessioneuseriptest);
此.writeListFile(“输入”,
sessioneuseriptestdata);
列表结果=此.getLinesForAlias(测试,“数据会话化”);
assertEquals(result.size(),1);
assertEquals(result.get(0).get(0),“Harold”);
}

您需要编写一个自定义项。您可以研究使用esperHi matterhayes,您可以看看我的sessione错误消息吗?非常感谢。当然,您需要将timestamp设置为第一个字段。我会用那张便条更新这个答案。看看我在其他评论中描述的FOREACH。
  /**
  define Sessionize datafu.pig.sessions.Sessionize('5s');

  define DistinctBy datafu.pig.bags.DistinctBy('1'); -- distinct by ip

  data = LOAD 'input' using PigStorage(',') AS (user:chararray,time:chararray,ip:chararray);

  data = FOREACH data GENERATE time,user,ip;

  data_sessionized = FOREACH (GROUP data BY user) {
    views = ORDER data BY time;
    GENERATE flatten(Sessionize(views)) as (time,user,ip,session_id);
  }

  data_sessionized = FOREACH data_sessionized GENERATE user,ip,session_id;

  data_sessionized = FOREACH (GROUP data_sessionized BY (user, session_id)) GENERATE
    group.user as user,
    SIZE(DistinctBy(data_sessionized)) as distinctIpCount;

  data_sessionized = FILTER data_sessionized BY distinctIpCount > 1;

  data_sessionized = FOREACH data_sessionized GENERATE user;

  data_sessionized = DISTINCT data_sessionized;

  STORE data_sessionized INTO 'output';
   */
  @Multiline private String sessionizeUserIpTest;

  private String[] sessionizeUserIpTestData = new String[] {
      "Harold,2014-02-18T16:14:49.503Z,123.123.123.123",
      "Harold,2014-02-18T16:14:51.503Z,123.123.123.123",
      "Harold,2014-02-18T16:14:55.503Z,321.321.321.321",
      "Kumar,2014-02-18T16:14:49.503Z,123.123.123.123",
      "Kumar,2014-02-18T16:14:55.503Z,123.123.123.123",
      "Kumar,2014-02-18T16:15:05.503Z,321.321.321.321"
  };

  @Test
  public void sessionizeUserIpTest() throws Exception
  {
    PigTest test = createPigTestFromString(sessionizeUserIpTest);

    this.writeLinesToFile("input", 
        sessionizeUserIpTestData);

    List<Tuple> result = this.getLinesForAlias(test, "data_sessionized");

    assertEquals(result.size(),1);
    assertEquals(result.get(0).get(0),"Harold");
  }