Python 从Pig流媒体加载JSON数据
我有一个python脚本stream.py,它从stdin读取json行,处理它,然后将json行写入stdout data.json中的输入行示例: 示例输出行:Python 从Pig流媒体加载JSON数据,python,json,stream,streaming,apache-pig,Python,Json,Stream,Streaming,Apache Pig,我有一个python脚本stream.py,它从stdin读取json行,处理它,然后将json行写入stdout data.json中的输入行示例: 示例输出行: {"user_id":3217,"description":"some text PROCESSED","rating":1.78} 在Pig中,我尝试以这种方式流式处理数据: data = LOAD 'data.json'; DEFINE my_stream `./stream.py` output (stdout USING
{"user_id":3217,"description":"some text PROCESSED","rating":1.78}
在Pig中,我尝试以这种方式流式处理数据:
data = LOAD 'data.json';
DEFINE my_stream `./stream.py` output (stdout USING JsonLoader('user_id:int, description:chararray, rating:float'));
data_streamed = STREAM data THROUGH my_stream;
ratings = FOREACH data_streamed GENERATE rating;
ratings_unique = DISTINCT ratings;
ratings_test = LIMIT ratings_unique 10;
DUMP ratings_test;
当我尝试执行时,会出现以下错误:
pig script failed to validate: java.lang.ClassCastException: class org.apache.pig.builtin.JsonLoader does not implement interface org.apache.pig.StreamToPig
到目前为止,我只看到了两个我希望尽可能避免的解决方案:
将流数据存储到临时文件中,并使用JsonLoader加载。
修改stream.py以编写tsv行而不是json行,这样我就可以使用默认的PigStorage加载它。
可以使用JsonLoader使清管器流媒体工作吗
pig script failed to validate: java.lang.ClassCastException: class org.apache.pig.builtin.JsonLoader does not implement interface org.apache.pig.StreamToPig