Regex 使用apachepig搜索hashtag

Regex 使用apachepig搜索hashtag,regex,apache,apache-pig,Regex,Apache,Apache Pig,我试图确定包含以下格式推文的文本文件中的前10个哈希标记: USER_79321756 2010-03-05T04:48:05 ÜT: 47.528139,-122.197916 47.528139 -122.197916 Just talkin too for real. Ha. USER_79321756 2010-03-05T20:25:56 ÜT: 47.528139,-122.197916 47.528139 -122.197916 RT @USER_620cd

我试图确定包含以下格式推文的文本文件中的前10个哈希标记:

USER_79321756   2010-03-05T04:48:05 ÜT: 47.528139,-122.197916   47.528139   -122.197916 Just talkin too for real. Ha.
USER_79321756   2010-03-05T20:25:56 ÜT: 47.528139,-122.197916   47.528139   -122.197916 RT @USER_620cd4b9: @USER_79321756 hey now! Leave me, and my big eyes alone LOL>>lol NO! :*
USER_4659ef22   2010-03-06T05:50:54 ÜT: 40.816206,-73.894429    40.816206   -73.894429  But where's @USER_55e0f4ff?? Hmmm shawty where u at?
USER_064b120e   2010-03-03T18:56:49 ÜT: 34.223957,-118.600448   34.223957   -118.600448 @USER_4a4d09c2 the ludacris one . have you heard it , he got off on that one .
我想出了以下代码片段来实现这一点

代码:

a = load '/user/lab/pig/full_text_small.txt' AS (id:chararray, ts:chararray, location:chararray, lat:float, lon:float, tweet:chararray);
b = foreach a generate tweet, FLATTEN(TOKENIZE(LOWER(tweet))) as tokens;
c = filter b by STARTSWITH(tokens,'#');
d = group c by tokens;
e = foreach d generate group as tokens, COUNT(c) as cnt;
f = order e by cnt desc;
g = limit f 10;
dump g; 
  (#ff, 55)
  (#inhighschool, 25)
  ...
  ...
  ...
  ...
  ...
  ...
  (#random, 9)
  (#mewithoutyouislike, 7)
a = load '/user/lab/pig/full_text_small.txt' AS (id:chararray, ts:chararray, location:chararray, lat:float, lon:float, tweet:chararray);
b = foreach a generate tweet, FLATTEN(TOKENIZE(LOWER(tweet))) as tokens;
c = filter b by tokens MATCHES '#\\s*(\\w+)';
d = group c by tokens;
e = foreach d generate group as tokens, COUNT(c) as cnt;
f = order e by cnt desc;
g = limit f 10;
dump g;
结果如下所示

结果:

a = load '/user/lab/pig/full_text_small.txt' AS (id:chararray, ts:chararray, location:chararray, lat:float, lon:float, tweet:chararray);
b = foreach a generate tweet, FLATTEN(TOKENIZE(LOWER(tweet))) as tokens;
c = filter b by STARTSWITH(tokens,'#');
d = group c by tokens;
e = foreach d generate group as tokens, COUNT(c) as cnt;
f = order e by cnt desc;
g = limit f 10;
dump g; 
  (#ff, 55)
  (#inhighschool, 25)
  ...
  ...
  ...
  ...
  ...
  ...
  (#random, 9)
  (#mewithoutyouislike, 7)
a = load '/user/lab/pig/full_text_small.txt' AS (id:chararray, ts:chararray, location:chararray, lat:float, lon:float, tweet:chararray);
b = foreach a generate tweet, FLATTEN(TOKENIZE(LOWER(tweet))) as tokens;
c = filter b by tokens MATCHES '#\\s*(\\w+)';
d = group c by tokens;
e = foreach d generate group as tokens, COUNT(c) as cnt;
f = order e by cnt desc;
g = limit f 10;
dump g;
我还包括了输出的图像

但是,如果我在word编辑器中打开包含推文的文本文件(full_text_small.txt)并搜索标签“#ff”(不区分大小写),我得到的总计数是61,而不是55。同样,输出中所有其他标签的计数与使用Pig获得的计数不同

此外,当我使用不同的匹配技术时,即下图所示,我得到的结果略有不同

代码:

a = load '/user/lab/pig/full_text_small.txt' AS (id:chararray, ts:chararray, location:chararray, lat:float, lon:float, tweet:chararray);
b = foreach a generate tweet, FLATTEN(TOKENIZE(LOWER(tweet))) as tokens;
c = filter b by STARTSWITH(tokens,'#');
d = group c by tokens;
e = foreach d generate group as tokens, COUNT(c) as cnt;
f = order e by cnt desc;
g = limit f 10;
dump g; 
  (#ff, 55)
  (#inhighschool, 25)
  ...
  ...
  ...
  ...
  ...
  ...
  (#random, 9)
  (#mewithoutyouislike, 7)
a = load '/user/lab/pig/full_text_small.txt' AS (id:chararray, ts:chararray, location:chararray, lat:float, lon:float, tweet:chararray);
b = foreach a generate tweet, FLATTEN(TOKENIZE(LOWER(tweet))) as tokens;
c = filter b by tokens MATCHES '#\\s*(\\w+)';
d = group c by tokens;
e = foreach d generate group as tokens, COUNT(c) as cnt;
f = order e by cnt desc;
g = limit f 10;
dump g;
结果:

  (#ff, 55)
  (#inhighschool, 25)
  ...
  ...
  ...
  ...
  ...
  ...
  (#random, 9)
  (#realgrandmas, 7)
第二个代码段的输出图像:

除了最后一个之外,这两个代码段的输出中的所有hashtag都是相同的

我的问题如下:

  • 为什么我对这两个代码片段在性能方面得到了不同的结果 最后一个标签
  • 为什么要使用这些代码片段获得结果 与在文本中使用搜索功能获得的结果不匹配 编辑
  • 以下是我的理论:

  • 上一次更改的hashtag与您提到的两个代码段无关。由于两个hashtag的计数相同,因此无法确定在
    排序
    和后续的
    限制
    过程中哪个将获得更高的优先权
  • 由于您使用的是
    TOKENIZE
    后跟
    STARTSWITH
    ,因此您希望hashtags前面有一个空格。在文本编辑器中搜索时,您的搜索可能包括“#ff”hashtags,该hashtags前面也没有空格