Hadoop 用pig拉丁语计算hashtag
我的示例数据集如下所示:Hadoop 用pig拉丁语计算hashtag,hadoop,apache-pig,Hadoop,Apache Pig,我的示例数据集如下所示: tmj_dc_mgmt, Washington, en, 483, 457, 256, ['hiring', 'BusinessMgmt', 'Washington', 'Job'] SRiku0728, 福山市, ja, 6705, 357, 273, ['None'] BesiktaSeyma_, Akyurt, tr, 12921, 1801, 283, ['None'] AnnaKFrick, Virginia, en, 5731, 682, 1120, ['I
tmj_dc_mgmt, Washington, en, 483, 457, 256, ['hiring', 'BusinessMgmt', 'Washington', 'Job']
SRiku0728, 福山市, ja, 6705, 357, 273, ['None']
BesiktaSeyma_, Akyurt, tr, 12921, 1801, 283, ['None']
AnnaKFrick, Virginia, en, 5731, 682, 1120, ['Investment', 'PPP', 'Bogota', 'jobs']
Accprimary, Manchester, en, 1650, 268, 404, ['None']
方括号内的数据是hashtags,我想数一数整个列表中前10位的hashtags
我已经走到了这一步,不知道如何进一步
twitter_feed = LOAD '/twitter-data-mining/15' USING PigStorage(',');
hash_tags = FOREACH twitter_feed GENERATE $7;
fallten = FILTER hash_tags BY $1 MATCHES '\w+'|'\w+(\s\w+)*'
DUMP fallten;
任何方向正确的帮助都将不胜感激
谢谢 load语句不正确。有两种方法可以获得哈希标记。第一种方法是使用“[”进行加载,然后操纵字符串对哈希标记进行计数。第二种方法是加载整行并使用regex_extract_all获得哈希标记。我列出了第一种方法。请参见下文
A = LOAD 'test10.txt' USING PigStorage('[');
B = FOREACH A GENERATE REPLACE(REPLACE($1,']',''),'\'','');
C = FOREACH B GENERATE FLATTEN(TOKENIZE(*));
D = FILTER C BY NOT($0 MATCHES 'None');
E = GROUP D by $0;
F = FOREACH E GENERATE group,COUNT(D.$0);
DUMP F;
输出
fallten包含什么?同样,使用“,”的laoding数据不会给出正确的结果。fallten没有给出正确的结果,我想知道我是否只能提取单引号内的哈希标记。