Hadoop Streaming mangles Python生产的Avro_Python_Hadoop_Mapreduce_Avro

Hadoop Streaming mangles Python生产的Avro

python hadoop mapreduce

Hadoop Streaming mangles Python生产的Avro,python,hadoop,mapreduce,avro,Python,Hadoop,Mapreduce,Avro,我有一个相当简单的脚本，它以JSON格式获取Twitter数据，并将其转换为Avro文件 from avro import schema, datafile, io import json, sys from types import * def main(): if len(sys.argv) < 2: print "Usage: cat input.json | python2.7 JSONtoAvro.py output" return

我有一个相当简单的脚本，它以JSON格式获取Twitter数据，并将其转换为Avro文件

from avro import schema, datafile, io
import json, sys
from types import *

def main():
    if len(sys.argv) < 2:
        print "Usage: cat input.json | python2.7 JSONtoAvro.py output"
        return

    s = schema.parse(open("tweet.avsc").read())
    f = open(sys.argv[1], 'wb')

    writer = datafile.DataFileWriter(f, io.DatumWriter(), s, codec = 'deflate')

    failed = 0

    for line in sys.stdin:
        line = line.strip()

    try:
        data = json.loads(line)
    except ValueError as detail:
        continue

    try:
        writer.append(data)
    except io.AvroTypeException as detail:
        print line
        failed += 1

writer.close()

print str(failed) + " failed in schema"

if __name__ == '__main__':
    main()

我正在努力解决这个问题。任何建议都将不胜感激。

不确定您是否仍在寻找答案。\u似乎是unicode字符。试试像这样的东西

   resp = json.dumps(line) 
   data = json.loads(resp)

如果是unicode表示法导致了错误，dumps将解决该问题

   resp = json.dumps(line) 
   data = json.loads(resp)