Spark scala数据集的批处理
我试图在Spark中创建一批Spark scala数据集的批处理,scala,apache-spark,spark-dataframe,apache-spark-dataset,Scala,Apache Spark,Spark Dataframe,Apache Spark Dataset,我试图在Spark中创建一批数据集的行。 为了保持发送到服务的记录数,我希望对项目进行批处理,以便保持数据的发送速率。 因为 对于给定的Dataset[Person]我想创建Dataset[PersonBatch] 例如,如果输入Dataset[Person]有100条记录,那么输出Dataset应该像Dataset[PersonBatch]一样,其中每个PersonBatch都应该是n记录(Person)的列表 我试过这个,但不起作用 object DataBatcher extends Lo
数据集的行。
为了保持发送到服务的记录数,我希望对项目进行批处理,以便保持数据的发送速率。
因为
对于给定的Dataset[Person]
我想创建Dataset[PersonBatch]
例如,如果输入Dataset[Person]
有100条记录,那么输出Dataset
应该像Dataset[PersonBatch]
一样,其中每个PersonBatch
都应该是n
记录(Person)的列表
我试过这个,但不起作用
object DataBatcher extends Logger {
var batchList: ListBuffer[PersonBatch] = ListBuffer[PersonBatch]()
var batchSize: Long = 500 //default batch size
def addToBatchList(batch: PersonBatch): Unit = {
batchList += batch
}
def clearBatchList(): Unit = {
batchList.clear()
}
def createBatches(ds: Dataset[Person]): Dataset[PersonBatch] = {
val dsCount = ds.count()
logger.info(s"Count of dataset passed for creating batches : ${dsCount}")
val batchElement = ListBuffer[Person]()
val batch = PersonBatch(batchElement)
ds.foreach(x => {
batch.personBatch += x
if(batch.personBatch.length == batchSize) {
addToBatchList(batch)
batch.requestBatch.clear()
}
})
if(batch.personBatch.length > 0) {
addToBatchList(batch)
batch.personBatch.clear()
}
sparkSession.createDataset(batchList)
}
}
我想在Hadoop集群上运行此作业。
有人能帮我吗?rdd.iterator has group函数可能对您有用
例如:
iter.分组(批量大小)
使用iter.grouped(batchsize)进行批插入的示例代码段这里是1000,我正在尝试插入到数据库中
df.repartition(numofpartitionsyouwant) // numPartitions ~ number of simultaneous DB connections you can planning to give...
def insertToTable(sqlDatabaseConnectionString: String,
sqlTableName: String): Unit = {
val tableHeader: String = dataFrame.columns.mkString(",")
dataFrame.foreachPartition { partition =>
//NOTE : EACH PARTITION ONE CONNECTION (more better way is to use connection pools)
val sqlExecutorConnection: Connection =
DriverManager.getConnection(sqlDatabaseConnectionString)
//Batch size of 1000 is used since some databases cant use batch size more than 1000 for ex : Azure sql
partition.grouped(1000).foreach { group =>
val insertString: scala.collection.mutable.StringBuilder =
new scala.collection.mutable.StringBuilder()
group.foreach { record =>
insertString.append("('" + record.mkString(",") + "'),")
}
sqlExecutorConnection
.createStatement()
.executeUpdate(f"INSERT INTO [$sqlTableName] ($tableHeader) VALUES "
+ insertString.stripSuffix(","))
}
sqlExecutorConnection.close() // close the connection so that connections wont exhaust.
}
}
df.repartition(numofpartitionsyouwant) // numPartitions ~ number of simultaneous DB connections you can planning to give...
def insertToTable(sqlDatabaseConnectionString: String,
sqlTableName: String): Unit = {
val tableHeader: String = dataFrame.columns.mkString(",")
dataFrame.foreachPartition { partition =>
//NOTE : EACH PARTITION ONE CONNECTION (more better way is to use connection pools)
val sqlExecutorConnection: Connection =
DriverManager.getConnection(sqlDatabaseConnectionString)
//Batch size of 1000 is used since some databases cant use batch size more than 1000 for ex : Azure sql
partition.grouped(1000).foreach { group =>
val insertString: scala.collection.mutable.StringBuilder =
new scala.collection.mutable.StringBuilder()
group.foreach { record =>
insertString.append("('" + record.mkString(",") + "'),")
}
sqlExecutorConnection
.createStatement()
.executeUpdate(f"INSERT INTO [$sqlTableName] ($tableHeader) VALUES "
+ insertString.stripSuffix(","))
}
sqlExecutorConnection.close() // close the connection so that connections wont exhaust.
}
}