Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Spark scala数据集的批处理_Scala_Apache Spark_Spark Dataframe_Apache Spark Dataset - Fatal编程技术网

Spark scala数据集的批处理

Spark scala数据集的批处理,scala,apache-spark,spark-dataframe,apache-spark-dataset,Scala,Apache Spark,Spark Dataframe,Apache Spark Dataset,我试图在Spark中创建一批数据集的行。 为了保持发送到服务的记录数,我希望对项目进行批处理,以便保持数据的发送速率。 因为 对于给定的Dataset[Person]我想创建Dataset[PersonBatch] 例如,如果输入Dataset[Person]有100条记录,那么输出Dataset应该像Dataset[PersonBatch]一样,其中每个PersonBatch都应该是n记录(Person)的列表 我试过这个,但不起作用 object DataBatcher extends Lo

我试图在Spark中创建一批
数据集的行。
为了保持发送到服务的记录数,我希望对项目进行批处理,以便保持数据的发送速率。
因为

对于给定的
Dataset[Person]
我想创建
Dataset[PersonBatch]

例如,如果输入
Dataset[Person]
有100条记录,那么输出
Dataset
应该像
Dataset[PersonBatch]
一样,其中每个
PersonBatch
都应该是
n
记录(Person)的列表

我试过这个,但不起作用

object DataBatcher extends Logger {

  var batchList: ListBuffer[PersonBatch] = ListBuffer[PersonBatch]()
  var batchSize: Long = 500  //default batch size

  def addToBatchList(batch: PersonBatch): Unit = {
    batchList += batch
  }

  def clearBatchList(): Unit = {
    batchList.clear()
  }

  def createBatches(ds: Dataset[Person]): Dataset[PersonBatch] = {

    val dsCount = ds.count()
    logger.info(s"Count of dataset passed for creating batches : ${dsCount}")
    val batchElement = ListBuffer[Person]()
    val batch = PersonBatch(batchElement)
    ds.foreach(x => {
      batch.personBatch += x
      if(batch.personBatch.length == batchSize) {
        addToBatchList(batch)
        batch.requestBatch.clear()
      }
    })
    if(batch.personBatch.length > 0) {
      addToBatchList(batch)
      batch.personBatch.clear()
    }
    sparkSession.createDataset(batchList)
  }  
}
我想在Hadoop集群上运行此作业。
有人能帮我吗?

rdd.iterator has group函数可能对您有用

例如:

iter.分组(批量大小)

使用iter.grouped(batchsize)进行批插入的示例代码段这里是1000,我正在尝试插入到数据库中

   df.repartition(numofpartitionsyouwant) // numPartitions ~ number of simultaneous DB connections you can planning to give...
def insertToTable(sqlDatabaseConnectionString: String,
                  sqlTableName: String): Unit = {

  val tableHeader: String = dataFrame.columns.mkString(",")
  dataFrame.foreachPartition { partition =>
    //NOTE : EACH PARTITION ONE CONNECTION (more better way is to use connection pools)
    val sqlExecutorConnection: Connection =
      DriverManager.getConnection(sqlDatabaseConnectionString)
    //Batch size of 1000 is used since some databases cant use batch size more than 1000 for ex : Azure sql
    partition.grouped(1000).foreach { group =>
      val insertString: scala.collection.mutable.StringBuilder =
        new scala.collection.mutable.StringBuilder()

      group.foreach { record =>
        insertString.append("('" + record.mkString(",") + "'),")
      }

      sqlExecutorConnection
        .createStatement()
        .executeUpdate(f"INSERT INTO [$sqlTableName] ($tableHeader) VALUES "
          + insertString.stripSuffix(","))
    }

    sqlExecutorConnection.close() // close the connection so that connections wont exhaust.
  }
}
   df.repartition(numofpartitionsyouwant) // numPartitions ~ number of simultaneous DB connections you can planning to give...
def insertToTable(sqlDatabaseConnectionString: String,
                  sqlTableName: String): Unit = {

  val tableHeader: String = dataFrame.columns.mkString(",")
  dataFrame.foreachPartition { partition =>
    //NOTE : EACH PARTITION ONE CONNECTION (more better way is to use connection pools)
    val sqlExecutorConnection: Connection =
      DriverManager.getConnection(sqlDatabaseConnectionString)
    //Batch size of 1000 is used since some databases cant use batch size more than 1000 for ex : Azure sql
    partition.grouped(1000).foreach { group =>
      val insertString: scala.collection.mutable.StringBuilder =
        new scala.collection.mutable.StringBuilder()

      group.foreach { record =>
        insertString.append("('" + record.mkString(",") + "'),")
      }

      sqlExecutorConnection
        .createStatement()
        .executeUpdate(f"INSERT INTO [$sqlTableName] ($tableHeader) VALUES "
          + insertString.stripSuffix(","))
    }

    sqlExecutorConnection.close() // close the connection so that connections wont exhaust.
  }
}