Apache spark 如何访问数据源选项(如卡夫卡)?
我将Spark batch process options设置为从Kafka使用,但当我尝试获取配置属性时,它显示为None。为什么会这样Apache spark 如何访问数据源选项(如卡夫卡)?,apache-spark,apache-spark-sql,spark-structured-streaming,Apache Spark,Apache Spark Sql,Spark Structured Streaming,我将Spark batch process options设置为从Kafka使用,但当我尝试获取配置属性时,它显示为None。为什么会这样 val df = sparkSession .read .format("org.apache.spark.sql.kafka010.KafkaSourceProvider") .option("kafka.bootstrap.servers", "kafka.brokers".getConfigValue)
val df = sparkSession
.read
.format("org.apache.spark.sql.kafka010.KafkaSourceProvider")
.option("kafka.bootstrap.servers", "kafka.brokers".getConfigValue)
.option("subscribe", "kafka.devicelocationdatatopic".getConfigValue)
.option("startingOffsets", "kafka.startingOffsets".getConfigValue)
.option("endingOffsets", "kafka.endingOffsets".getConfigValue)
.option("failOnDataLoss", "false") // any failure regarding data loss in topic or else, not supposed to fail, it has to continue...
.option("maxOffsetsPerTrigger", "3")
.load()
println("maxOffsetsPerTrigger " + df.sparkSession.conf.getOption("maxOffsetsPerTrigger"))
电流输出
None
期望输出
maxOffsetsPerTrigger 3
当我尝试获取配置属性时,它显示为None。为什么会这样
val df = sparkSession
.read
.format("org.apache.spark.sql.kafka010.KafkaSourceProvider")
.option("kafka.bootstrap.servers", "kafka.brokers".getConfigValue)
.option("subscribe", "kafka.devicelocationdatatopic".getConfigValue)
.option("startingOffsets", "kafka.startingOffsets".getConfigValue)
.option("endingOffsets", "kafka.endingOffsets".getConfigValue)
.option("failOnDataLoss", "false") // any failure regarding data loss in topic or else, not supposed to fail, it has to continue...
.option("maxOffsetsPerTrigger", "3")
.load()
println("maxOffsetsPerTrigger " + df.sparkSession.conf.getOption("maxOffsetsPerTrigger"))
它们仅对基础数据源可用。Spark SQL试图隐藏使用不同数据源的复杂性,这是许多实现细节之一
df.sparkSession.conf.getOption(“maxOffsetsPerTrigger”)
这与描述数据源(例如卡夫卡)时指定的选项不同
在上面的示例中,您希望访问Spark属性maxOffsetsPerTrigger
,而选项
部分是关于返回的Scala类型,而不是单词“Option”的常见含义
您可以使用命令行上的--conf
指定Spark属性。请注意,只允许使用带有前缀的属性
$ spark-shell \
--packages org.apache.spark:spark-sql-kafka-0-10_2.12:2.4.4 \
--conf spark.maxOffsetsPerTrigger=3
scala> spark.conf.getOption("spark.maxOffsetsPerTrigger")
res0: Option[String] = Some(3)
期望输出
因为它不是现成的,所以你必须绕开一些“私人”围栏
下面的代码应该可以做到这一点。使用风险自负
// spark-shell --packages org.apache.spark:spark-sql-kafka-0-10_2.12:2.4.4
val df = spark
.read
.format("kafka")
.option("kafka.bootstrap.servers", "demo:9092")
.option("subscribe", "demo")
.option("maxOffsetsPerTrigger", "3")
.load
val plan = df.queryExecution.logical
scala> println(plan.numberedTreeString)
00 Relation[key#0,value#1,topic#2,partition#3,offset#4L,timestamp#5,timestampType#6] KafkaRelation(strategy=Subscribe[demo], start=EarliestOffsetRangeLimit, end=LatestOffsetRangeLimit)
// :paste -raw
// BEGIN
package org.apache.spark.sql.kafka010
object Util {
import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
def bypassPrivateKafka010(plan: LogicalPlan) = {
import org.apache.spark.sql.execution.datasources.LogicalRelation
import org.apache.spark.sql.kafka010.KafkaRelation
plan.collect { case LogicalRelation(r: KafkaRelation, _, _, _) => r }
}
}
// END
import org.apache.spark.sql.kafka010.Util.bypassPrivateKafka010
import org.apache.spark.sql.kafka010.KafkaRelation
val kafkaRelation = bypassPrivateKafka010(plan).head
// sourceOptions is a private field of KafkaRelation
// :paste -raw
// BEGIN
package org.apache.spark.sql.kafka010
object Util2 {
import org.apache.spark.sql.kafka010.KafkaRelation
def bypassPrivate(r: KafkaRelation): Map[String, String] = {
val clazz = classOf[KafkaRelation]
val sourceOptions = clazz.getDeclaredField("sourceOptions")
sourceOptions.setAccessible(true)
sourceOptions.get(r).asInstanceOf[Map[String, String]]
}
}
// END
import org.apache.spark.sql.kafka010.Util2.bypassPrivate
val options = bypassPrivate(kafkaRelation)
scala> options.foreach(println)
(maxoffsetspertrigger,3)
(subscribe,demo)
(kafka.bootstrap.servers,demo:9092)
当我尝试获取配置属性时,它显示为None。为什么会这样
val df = sparkSession
.read
.format("org.apache.spark.sql.kafka010.KafkaSourceProvider")
.option("kafka.bootstrap.servers", "kafka.brokers".getConfigValue)
.option("subscribe", "kafka.devicelocationdatatopic".getConfigValue)
.option("startingOffsets", "kafka.startingOffsets".getConfigValue)
.option("endingOffsets", "kafka.endingOffsets".getConfigValue)
.option("failOnDataLoss", "false") // any failure regarding data loss in topic or else, not supposed to fail, it has to continue...
.option("maxOffsetsPerTrigger", "3")
.load()
println("maxOffsetsPerTrigger " + df.sparkSession.conf.getOption("maxOffsetsPerTrigger"))
它们仅对基础数据源可用。Spark SQL试图隐藏使用不同数据源的复杂性,这是许多实现细节之一
df.sparkSession.conf.getOption(“maxOffsetsPerTrigger”)
这与描述数据源(例如卡夫卡)时指定的选项不同
在上面的示例中,您希望访问Spark属性maxOffsetsPerTrigger
,而选项
部分是关于返回的Scala类型,而不是单词“Option”的常见含义
您可以使用命令行上的--conf
指定Spark属性。请注意,只允许使用带有前缀的属性
$ spark-shell \
--packages org.apache.spark:spark-sql-kafka-0-10_2.12:2.4.4 \
--conf spark.maxOffsetsPerTrigger=3
scala> spark.conf.getOption("spark.maxOffsetsPerTrigger")
res0: Option[String] = Some(3)
期望输出
因为它不是现成的,所以你必须绕开一些“私人”围栏
下面的代码应该可以做到这一点。使用风险自负
// spark-shell --packages org.apache.spark:spark-sql-kafka-0-10_2.12:2.4.4
val df = spark
.read
.format("kafka")
.option("kafka.bootstrap.servers", "demo:9092")
.option("subscribe", "demo")
.option("maxOffsetsPerTrigger", "3")
.load
val plan = df.queryExecution.logical
scala> println(plan.numberedTreeString)
00 Relation[key#0,value#1,topic#2,partition#3,offset#4L,timestamp#5,timestampType#6] KafkaRelation(strategy=Subscribe[demo], start=EarliestOffsetRangeLimit, end=LatestOffsetRangeLimit)
// :paste -raw
// BEGIN
package org.apache.spark.sql.kafka010
object Util {
import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
def bypassPrivateKafka010(plan: LogicalPlan) = {
import org.apache.spark.sql.execution.datasources.LogicalRelation
import org.apache.spark.sql.kafka010.KafkaRelation
plan.collect { case LogicalRelation(r: KafkaRelation, _, _, _) => r }
}
}
// END
import org.apache.spark.sql.kafka010.Util.bypassPrivateKafka010
import org.apache.spark.sql.kafka010.KafkaRelation
val kafkaRelation = bypassPrivateKafka010(plan).head
// sourceOptions is a private field of KafkaRelation
// :paste -raw
// BEGIN
package org.apache.spark.sql.kafka010
object Util2 {
import org.apache.spark.sql.kafka010.KafkaRelation
def bypassPrivate(r: KafkaRelation): Map[String, String] = {
val clazz = classOf[KafkaRelation]
val sourceOptions = clazz.getDeclaredField("sourceOptions")
sourceOptions.setAccessible(true)
sourceOptions.get(r).asInstanceOf[Map[String, String]]
}
}
// END
import org.apache.spark.sql.kafka010.Util2.bypassPrivate
val options = bypassPrivate(kafkaRelation)
scala> options.foreach(println)
(maxoffsetspertrigger,3)
(subscribe,demo)
(kafka.bootstrap.servers,demo:9092)
嗨,jacek,作为暂时的解决办法,它起了作用。所以,我接受这个答案。谢谢你的帮助。嗨,jacek,作为暂时的解决办法,它奏效了。所以,我接受这个答案。谢谢你的帮助。