Apache pig Pig：通过加载列表进行高效过滤_Apache Pig

Apache pig Pig：通过加载列表进行高效过滤

apache-pig

Apache pig Pig：通过加载列表进行高效过滤,apache-pig,Apache Pig,在ApachePig（版本0.16.x）中，通过数据集某个字段的现有值列表过滤数据集的最有效方法是什么比如说,，（根据@inquisitive\u mind's tip更新）输入：一个行分隔的文件，每行一个值 my_codes.txt '110' '100' '000' 示例_data.txt '110', 2 '110', 3 '001', 3 '000', 1 期望输出 '110', 2 '110', 3 '000', 1 示例脚本 %default my_codes_file

在ApachePig（版本0.16.x）中，通过数据集某个字段的现有值列表过滤数据集的最有效方法是什么

比如说,，（根据@inquisitive\u mind's tip更新）

输入：一个行分隔的文件，每行一个值 my_codes.txt

'110'
'100'
'000'

示例_data.txt

'110', 2
'110', 3
'001', 3
'000', 1

期望输出

'110', 2
'110', 3
'000', 1

示例脚本

%default my_codes_file 'my_codes.txt'
%default sample_data_file 'sample_data.txt'
my_codes = LOAD '$my_codes_file' as (code:chararray)
sample_data = LOAD '$sample_data_file' as (code: chararray, point: float)
desired_data = FILTER sample_data BY code IN (my_codes.code);

错误：

Scalar has more than one row in the output. 1st : ('110'), 2nd :('100') 
(common cause: "JOIN" then "FOREACH ... GENERATE foo.bar" should be "foo::bar" )

我还尝试了在我的代码中按代码过滤样本数据

按代码过滤样本数据（我的代码）但出现错误：
需要从关系中投影列，才能将其用作标量。
my_codes.txt文件将代码作为行而不是列。由于要将其加载到单个字段中，代码应如下所示
'110'
'100'
'000'

或者，您可以使用JOIN
joined_data = JOIN sample_date BY code,my_codes BY code;
desired_data = FOREACH joined_data GENERATE $0,$1;

此外，在计算上更倾向于使用Filter-to-Join，但在这种情况下，它们可能同样密集（例如，如果my_codex.txt包含10亿条记录）？是的，Filter是有利的。我怀疑“in”会起作用，因为在PIG的早期版本中，必须显式指定用逗号分隔并用括号括起来的值。正确。即使在更高版本中，如果my_codes>1000个值，“IN”似乎也不实用。因此，您的连接解决方案似乎是最好的后续解决方案：如果与查询的数据集相比，现有列表很小，那么复制连接Pig比标准连接更有效。