Hadoop ApachePig-如何获得多个包之间匹配元素的数量?

Hadoop ApachePig-如何获得多个包之间匹配元素的数量?,hadoop,bigdata,apache-pig,latin,Hadoop,Bigdata,Apache Pig,Latin,我是ApachePig的新用户,我有一个问题要解决 我想用ApachePig做一个小搜索引擎。想法很简单:我有一个文件,它是多个文档的串联(每行一个文档)。以下是一个包含三个文档的示例: 1,word1 word4 word2 word1 2,word2 word6 word1 word5 word3 3,word1 word3 word4 word5 1,word1 word4 word2 word1 2,word2 word6 word1 word5 word3 word7 3,word1

我是ApachePig的新用户,我有一个问题要解决

我想用ApachePig做一个小搜索引擎。想法很简单:我有一个文件,它是多个文档的串联(每行一个文档)。以下是一个包含三个文档的示例:

1,word1 word4 word2 word1
2,word2 word6 word1 word5 word3
3,word1 word3 word4 word5
1,word1 word4 word2 word1
2,word2 word6 word1 word5 word3 word7
3,word1 word3 word4 word5
然后,我使用以下代码行为每个文档创建一个单词包:

docs = LOAD '$documents' USING PigStorage(',') AS (id:int, line:chararray);
B = FOREACH docs GENERATE line;
C = FOREACH B GENERATE TOKENIZE(line) as gu;
然后,我删除行李上的重复条目:

filtered = FOREACH C {
    uniq = DISTINCT gu;
    GENERATE uniq;
}
以下是此代码的结果:

DUMP filtered;

({(word1), (word4),  (word2)})
({(word2), (word6),  (word1), (word5), (word3)})
({(word1), (word3),  (word4), (word5)})
所以我每个文档都有一袋我想要的文字

现在,让我们把用户查询看作一个文件:

word2 word7 word5
我将查询转换为一袋单词:

query = LOAD '$query' AS (line_query:chararray);
bag_query = FOREACH query GENERATE TOKENIZE(line_query) AS quer;

DUMP bag_query;
结果如下:

({(word2), (word7), (word5)})
现在,我的问题是:我想得到查询和每个文档之间的匹配数。在本例中,我希望得到以下输出:

1
2
1
我试图把两个袋子连接起来,但没有成功

你能帮我吗


谢谢。

尝试使用SetIntersect(一个Datafu UDF-)和SIZE来获取结果包中的元素数。

如果您可以不使用任何UDF,那么可以通过旋转包并使用所有SQL样式来完成

docs = LOAD '/input/search.dat' USING PigStorage(',') AS (id:int, line:chararray);
C = FOREACH docs GENERATE id, TOKENIZE(line) as gu;
pivoted = FOREACH C {
    uniq = DISTINCT gu;
        GENERATE id, FLATTEN(uniq) as word;
};
filtered = FILTER pivoted BY word MATCHES '(word2|word7|word5)';
--dump filtered;
count_id_matched = FOREACH (GROUP filtered BY id) GENERATE group as id, COUNT(filtered) as count;

dump count_id_matched;

count_word_matched_in_docs = FOREACH (GROUP filtered BY word) GENERATE group as word, COUNT(filtered) as count;

dump count_word_matched_in_docs;

正如SNeumann指出的,您可以使用DataFu的SetIntersect作为示例

以您的示例为基础,给出以下文档:

1,word1 word4 word2 word1
2,word2 word6 word1 word5 word3
3,word1 word3 word4 word5
1,word1 word4 word2 word1
2,word2 word6 word1 word5 word3 word7
3,word1 word3 word4 word5
鉴于这个问题:

word2 word7 word5
然后,此代码将为您提供所需的:

define SetIntersect datafu.pig.sets.SetIntersect();

docs = LOAD 'docs' USING PigStorage(',') AS (id:int, line:chararray);
B = FOREACH docs GENERATE id, line;
C = FOREACH B GENERATE id, TOKENIZE(line) as gu;

filtered = FOREACH C {
  uniq = DISTINCT gu;
  GENERATE id, uniq;
}

query = LOAD 'query' AS (line_query:chararray);
bag_query = FOREACH query GENERATE TOKENIZE(line_query) AS query;
-- sort the bag of tokens, since SetIntersect requires it
bag_query = FOREACH bag_query {
  query_sorted = ORDER query BY token;
  GENERATE query_sorted;
}

result = FOREACH filtered {
  -- sort the tokens, since SetIntersect requires it
  tokens_sorted = ORDER uniq BY token;
  GENERATE id, 
           SIZE(SetIntersect(tokens_sorted,bag_query.query_sorted)) as cnt;
}

DUMP result;
结果值:

(1,1)
(2,3)
(3,1)
下面是一个完整的工作示例,您可以将其粘贴到SetIntersect的DataFu单元测试中:


如果您还有其他类似的用例,我很想听听:)我们一直在寻求为DataFu提供更多有用的UDF。

谢谢您的回复,但它不起作用。事实上,我的包在不同的变量中,似乎SetIntersect要求包在相同的变量中。