Hadoop Streaming mangles Python生产的Avro
我有一个相当简单的脚本,它以JSON格式获取Twitter数据,并将其转换为Avro文件Hadoop Streaming mangles Python生产的Avro,python,hadoop,mapreduce,avro,Python,Hadoop,Mapreduce,Avro,我有一个相当简单的脚本,它以JSON格式获取Twitter数据,并将其转换为Avro文件 from avro import schema, datafile, io import json, sys from types import * def main(): if len(sys.argv) < 2: print "Usage: cat input.json | python2.7 JSONtoAvro.py output" return
from avro import schema, datafile, io
import json, sys
from types import *
def main():
if len(sys.argv) < 2:
print "Usage: cat input.json | python2.7 JSONtoAvro.py output"
return
s = schema.parse(open("tweet.avsc").read())
f = open(sys.argv[1], 'wb')
writer = datafile.DataFileWriter(f, io.DatumWriter(), s, codec = 'deflate')
failed = 0
for line in sys.stdin:
line = line.strip()
try:
data = json.loads(line)
except ValueError as detail:
continue
try:
writer.append(data)
except io.AvroTypeException as detail:
print line
failed += 1
writer.close()
print str(failed) + " failed in schema"
if __name__ == '__main__':
main()
我正在努力解决这个问题。任何建议都将不胜感激。不确定您是否仍在寻找答案。\u似乎是unicode字符。试试像这样的东西
resp = json.dumps(line)
data = json.loads(resp)
如果是unicode表示法导致了错误,dumps将解决该问题
resp = json.dumps(line)
data = json.loads(resp)