Warning: file_get_contents(/data/phpspider/zhask/data//catemap/9/java/349.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
试图通过Java SDK将记录从Spark数据帧写入Dynamodb时,任务不可序列化_Java_Amazon Web Services_Apache Spark_Apache Spark Sql_Amazon Dynamodb - Fatal编程技术网

试图通过Java SDK将记录从Spark数据帧写入Dynamodb时,任务不可序列化

试图通过Java SDK将记录从Spark数据帧写入Dynamodb时,任务不可序列化,java,amazon-web-services,apache-spark,apache-spark-sql,amazon-dynamodb,Java,Amazon Web Services,Apache Spark,Apache Spark Sql,Amazon Dynamodb,下面是代码片段: val client = AmazonDynamoDBClientBuilder.standard.withRegion(Regions.the_region).withCredentials(new AWSStaticCredentialsProvider(new BasicAWSCredentials("access_key", "secret_key"))).build() val dynamoDB = new DynamoDB(client) val table = d

下面是代码片段:

val client = AmazonDynamoDBClientBuilder.standard.withRegion(Regions.the_region).withCredentials(new AWSStaticCredentialsProvider(new BasicAWSCredentials("access_key", "secret_key"))).build()
val dynamoDB = new DynamoDB(client)
val table = dynamoDB.getTable("tbl_name")

def putItem(email: String, name: String): Unit = {
    val item = new Item().withPrimaryKey("email", email).withNumber("ts", System.currentTimeMillis).withString("name", name)
    table.putItem(item)
}

spark.sql("""
select
    email,
    name
from db.hive_table_name
""").rdd.repartition(40).map(row => putItem(row.getString(0), row.getString(1))).collect()
我打算通过AWS提供的Java SDK将每条记录写入Dynamodb表,但它抱怨错误如下:

org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2094)
at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:370)
at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:369)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.RDD.map(RDD.scala:369)

如何调整代码以创建每个分区的
DynamoDB
Table
对象,从而利用Spark作业的并行性。谢谢

我将使用
foreachPartition
,而不是
map
collect

spark.sql(query).rdd.repartition(40).foreachPartition(iter => {

  val client = AmazonDynamoDBClientBuilder.standard.withRegion(Regions.the_region)
    .withCredentials(new AWSStaticCredentialsProvider(new BasicAWSCredentials("access_key", "secret_key"))).build()
  val dynamoDB = new DynamoDB(client)
  val table = dynamoDB.getTable("tbl_name")


  iter.foreach(row => putItem(row.getString(0), row.getString(1)))
})

DynamoDB
不可序列化。。。