Apache pig Hadoop Pig-替换映射中与其对应值相关的字符串

Apache pig Hadoop Pig-替换映射中与其对应值相关的字符串,apache-pig,Apache Pig,我有一个叫做对话的关系,由不同大小的元组组成,如下所示: DUMP conversations_grouped: ... ({(L194),(L195),(L196),(L197)}) ({(L198),(L199)}) ({(L200),(L201),(L202),(L203)}) ({(L204),(L205),(L206)}) ({(L207),(L208)}) ({(L271),(L272),(L273),(L274),(L275)}) ({(L276),(L277)}) ({(L280

我有一个叫做对话的关系,由不同大小的元组组成,如下所示:

DUMP conversations_grouped:
...
({(L194),(L195),(L196),(L197)})
({(L198),(L199)})
({(L200),(L201),(L202),(L203)})
({(L204),(L205),(L206)})
({(L207),(L208)})
({(L271),(L272),(L273),(L274),(L275)})
({(L276),(L277)})
({(L280),(L281)})
({(L363),(L364)})
({(L365),(L366)})
({(L666256),(L666257)})
({(L666369),(L666370),(L666371),(L666372)})
({(L666520),(L666521),(L666522)})
每个L[0-9]+都是一个对应于字符串的标记。例如,L194可能是“你好,你好吗?”而L195可能是“很好,你好吗?”。这种对应关系由一个称为line_map的映射来维护。以下是一个示例:

DUMP line_map;
...
([L666324#Do you think she might be interested in  someone?])
([L666264#Well that's typical of Her Majesty's army. Appoint an engineer to do a soldier's work.])
([L666263#Um. There are rumours that my Lord Chelmsford intends to make Durnford Second in Command.])
([L666262#Lighting COGHILL' 5 cigar: Our good Colonel Dumford scored quite a coup with the Sikali Horse.])
([L666522#So far only their scouts. But we have had reports of a small Impi farther north, over there. ])
([L666521#And I assure you, you do not In fact I'd be obliged for your best advice. What have your scouts seen?])
([L666520#Well I assure you, Sir, I have no desire to create difficulties. 45])
([L666372#I think Chelmsford wants a good man on the border Why he fears a flanking attack and requires a steady Commander in reserve.])
([L666371#Lord Chelmsford seems to want me to stay back with my Basutos.])
([L666370#I'm to take the Sikali with the main column to the river])
([L666369#Your orders, Mr Vereker?])
([L666257#Good ones, yes, Mr Vereker. Gentlemen who can ride and shoot])
([L666256#Colonel Durnford... William Vereker. I hear you 've been seeking Officers?])

我现在要做的是解析每一行,并用行映射中相应的文本替换L[0-9]+标记。是否可以从Pig FOREACH语句中引用line_map,或者我还需要做其他事情?

第一个问题是,在map中,键必须是带引号的字符串。因此,不能使用架构值访问映射。例如,此将不起作用。

C: {foo: chararray, M: [value:chararray]}
D = FOREACH C GENERATE M#foo ;
我想到的解决办法是将对话变平。然后在L[0-9]+标记上的对话和线条映射之间进行连接。您可能希望投射出一些额外的字段(比如join之后的L[0-9]+标记),以加快下一步。之后,您必须重新组合数据,并将其转换为正确的格式

除非每个包都有自己的唯一ID用于重新分组,否则这将不起作用,但如果每个L[0-9]+标记仅出现在一个包(对话)中,则可以使用它创建唯一ID

-- A is dumped conversations_grouped

B = FOREACH A {
    -- Pulls out an element from the bag to use as the id
    id = LIMIT tags 1 ;
    -- Flattens B into id, tag form.  Each group of tags will have the same id.
    GENERATE FLATTEN(id), FLATTEN(tags) ; 
    } 
B的模式和输出为:

B: {id: chararray,tags::tag: chararray}
(L194,L194)
(L194,L195)
(L194,L196)
(L194,L197)
(L198,L198)
(L198,L199)
(L200,L200)
(L200,L201)
(L200,L202)
(L200,L203)
(L204,L204)
(L204,L205)
(L204,L206)
(L207,L207)
(L207,L208)
(L271,L271)
(L271,L272)
(L271,L273)
(L271,L274)
(L271,L275)
(L276,L276)
(L276,L277)
(L280,L280)
(L280,L281)
(L363,L363)
(L363,L364)
(L365,L365)
(L365,L366)
(L666256,L666256)
(L666256,L666257)
(L666369,L666369)
(L666369,L666370)
(L666369,L666371)
(L666369,L666372)
(L666520,L666520)
(L666520,L666521)
(L666520,L666522)
假设标记是唯一的,其余的操作如下:

-- A2 is line_map, loaded in tag/message pairs instead of a map

-- Joins conversations_grouped and line_map on tag
C = FOREACH (JOIN B by tags::tag, A2 by tag)
    -- This generate removes the tag
    GENERATE id, message ;

-- Regroups C on the id created in B
D = FOREACH (GROUP C BY id) 
    -- This step limits the output to just messages
    GENERATE C.(message) AS messages ;
来自D的模式和输出:

D: {messages: {(A2::message: chararray)}}
({(Colonel Durnford... William Vereker. I hear you 've been seeking Officers?),(Good ones, yes, Mr Vereker. Gentlemen who can ride and shoot)})
({(Your orders, Mr Vereker?),(I'm to take the Sikali with the main column to the river),(Lord Chelmsford seems to want me to stay back with my Basutos.),(I think Chelmsford wants a good man on the border Why he fears a flanking attack and requires a steady Commander in reserve.)})
({(Well I assure you, Sir, I have no desire to create difficulties. 45),(And I assure you, you do not In fact I'd be obliged for your best advice. What have your scouts seen?),(So far only their scouts. But we have had reports of a small Impi farther north, over there. )})
注意:如果在最坏的情况下,(L[0-9]+标记不是唯一的),您可以在将输入文件加载到pig之前,为输入文件的每一行指定一个连续的整数id


更新:如果您使用的是pig 0.11,那么您也可以使用操作符。

您可以添加一个line\u map示例吗?@mr2ert我已经更新了问题。您可以发布
描述line\u map
描述对话的结果吗,也就是说,这两个别名的模式我认为中断和重组的想法是一个好主意。有没有办法给每个包添加一个索引?可能创建一个包含1、2、。。。conversations\u grouped.length()并将其加入到conversations\u grouped?这不是我能想到的。但是,L[0-9]+标记对于对话来说是唯一的吗?所以,如果L194在对话组的一行中,它能显示在另一行中吗?太好了!然后一定要再读一遍我的答案。我已经用一个依赖于标签唯一性的解决方案更新了它。谢谢你的回答!我不再依赖于标签的唯一性,而是使用了您的想法,即在每个包中附加一个对话ID。RANK声明非常有用。