Scala 从DynamoDB到EMR PySpark的数据:对象不可序列化

Scala 从DynamoDB到EMR PySpark的数据:对象不可序列化,scala,amazon-web-services,pyspark,amazon-dynamodb,emr,Scala,Amazon Web Services,Pyspark,Amazon Dynamodb,Emr,我一直在尝试将数据从DynamoDB加载到EMR spark应用程序。根据这篇文章,我尝试了这种方法 我可以成功地复制这篇文章的Scala代码(在一个简单的电影评级数据上)。但是,我想使用PySpark而不是Scala。所以我跑了 $pyspark --jars /usr/share/aws/emr/ddb/lib/emr-ddb-hadoop.jar conf = {"dynamodb.servicename": "dynamodb", "dynamodb.input.tableName":

我一直在尝试将数据从DynamoDB加载到EMR spark应用程序。根据这篇文章,我尝试了这种方法

我可以成功地复制这篇文章的Scala代码(在一个简单的电影评级数据上)。但是,我想使用PySpark而不是Scala。所以我跑了

$pyspark --jars /usr/share/aws/emr/ddb/lib/emr-ddb-hadoop.jar

conf = {"dynamodb.servicename": "dynamodb", "dynamodb.input.tableName":"myDynamoDBTable", "dynamodb.endpoint": "https://dynamodb.us-east-1.amazonaws.com", "dynamodb.regionid":"us-east-1", "mapred.input.format.class":"org.apache.hadoop.dynamodb.read.DynamoDBInputFormat","mapred.output.format.class":"org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat"}

orders = sc.hadoopRDD(inputFormatClass="org.apache.hadoop.dynamodb.read.DynamoDBInputFormat", keyClass="org.apache.hadoop.io.Text", valueClass="org.apache.hadoop.dynamodb.DynamoDBItemWritable", conf=conf)
但是,它引起了一个例外:

16/08/09 00:34:18 ERROR TaskSetManager: Task 0.0 in stage 8.0 (TID 8) had a not serializable result: org.apache.hadoop.dynamodb.DynamoDBItemWritable
Serialization stack:
- object not serializable (class: org.apache.hadoop.dynamodb.DynamoDBItemWritable, value: {"item_id":{"n":"661"},"user_id":{"n":"1"},"info":{"m":{"time_stamp":{"s":"2016-08-08T16:46:14.920485"},"rating":{"n":"3"},"is_true":{"bOOL":true}}}})
- field (class: scala.Tuple2, name: _2, type: class java.lang.Object)
- object (class scala.Tuple2, (,{"item_id":{"n":"661"},"user_id":{"n":"1"},"info":{"m":{"time_stamp":{"s":"2016-08-08T16:46:14.920485"},"rating":{"n":"3"},"is_true":{"bOOL":true}}}}))
- element of array (index: 0)
- array (class [Lscala.Tuple2;, size 1); not retrying
16/08/09 00:34:18 WARN ExecutorAllocationManager: No stages are running, but numRunningTasks != 0
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "/usr/lib/spark/python/pyspark/context.py", line 703, in hadoopRDD
jconf, batchSize)
File "/usr/lib/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", line 933, in __call__
File "/usr/lib/spark/python/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/usr/lib/spark/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", line 312, in get_return_value
.....
16/08/09 00:34:18错误TaskSetManager:阶段8.0(TID 8)中的任务0.0具有不可序列化的结果:org.apache.hadoop.dynamodb.DynamoDBItemWritable
序列化堆栈:
-对象不可序列化(类:org.apache.hadoop.dynamodb.dynamodbitemwriteable,值:{“item_id”:{“n”:“661”},“user_id”:{“n”:“1”},“info”:{“m”:{“time_stamp”:{“s”:“2016-08-08T16:46:14.920485”},评级:{“n”:“3”},是真的吗?})
-字段(类:scala.Tuple2,名称:_2,类型:class java.lang.Object)
-对象(类scala.Tuple2,(,{“item_id”:{“n”:“661”},“user_id”:{“n”:“1”},“info”:{“m”:{“time_stamp”:{“s”:“2016-08-08T16:46:14.920485”},“rating”:{“n”:“3”},“is true”:{“bOOL”:true})
-数组的元素(索引:0)
-数组(类[Lscala.Tuple2;,大小1);不重试
16/08/09 00:34:18警告执行分配管理器:没有运行任何阶段,但numRunningTasks!=0
回溯(最近一次呼叫最后一次):
文件“”,第2行,在
hadoopRDD中的文件“/usr/lib/spark/python/pyspark/context.py”,第703行
jconf,batchSize)
文件“/usr/lib/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py”,第933行,在__调用中__
文件“/usr/lib/spark/python/pyspark/sql/utils.py”,第63行,deco格式
返回f(*a,**kw)
文件“/usr/lib/spark/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py”,第312行,在get_return_值中
.....
有人能帮我理解这个错误吗?我已经想了很久了,但没有任何线索。我很感激