Scala Spark流式空RDD问题
我正在尝试从RDBMS创建自定义流接收器Scala Spark流式空RDD问题,scala,apache-spark,spark-streaming,Scala,Apache Spark,Spark Streaming,我正在尝试从RDBMS创建自定义流接收器 val dataDStream = ssc.receiverStream(new inputReceiver ()) dataDStream.foreachRDD((rdd:RDD[String],time:Time)=> { val newdata=rdd.flatMap(x=>x.split(",")) newdata.foreach(println) // *******This line has problem,
val dataDStream = ssc.receiverStream(new inputReceiver ())
dataDStream.foreachRDD((rdd:RDD[String],time:Time)=> {
val newdata=rdd.flatMap(x=>x.split(","))
newdata.foreach(println) // *******This line has problem, newdata has no records
})
ssc.start()
ssc.awaitTermination()
}
class inputReceiver extends Receiver[String](StorageLevel.MEMORY_AND_DISK_2) with Logging {
def onStart() {
// Start the thread that receives data over a connection
new Thread("RDBMS data Receiver") {
override def run() {
receive()
}
}.start()
}
def onStop() {
}
def receive() {
val sqlcontext = SQLContextSingleton.getInstance()
// **** I am assuming something wrong in following code
val DF = sqlcontext.read.json("/home/cloudera/data/s.json")
for (data <- rdd) {
store(data.toString())
}
logInfo("Stopped receiving")
restart("Trying to connect again")
}
}
val dataDStream=ssc.receiverStream(新的inputReceiver())
dataDStream.foreachRDD((rdd:rdd[String],time:time)=>{
val newdata=rdd.flatMap(x=>x.split(“,”))
newdata.foreach(println)/****此行有问题,newdata没有记录
})
ssc.start()
ssc.终止协议()
}
类inputReceiver通过日志记录扩展Receiver[String](StorageLevel.MEMORY_和_DISK_2){
def onStart(){
//启动通过连接接收数据的线程
新线程(“RDBMS数据接收器”){
覆盖def运行(){
接收()
}
}.start()
}
def onStop(){
}
def接收(){
val sqlcontext=SQLContextSingleton.getInstance()
//****我假设以下代码中有错误
val DF=sqlcontext.read.json(“/home/cloudera/data/s.json”)
对于(数据要使代码正常工作,应更改以下内容:
def receive() {
val sqlcontext = SQLContextSingleton.getInstance()
val DF = sqlcontext.read.json("/home/cloudera/data/s.json")
// **** this:
rdd.collect.foreach(data => store(data.toString()))
logInfo("Stopped receiving")
restart("Trying to connect again")
}
但是这是不可取的,因为json文件中的所有数据都将由驱动程序处理,并且接收器没有适当考虑可靠性
我怀疑Spark Streaming不适合您的用例。从字里行间看,似乎您正在流式处理,因此需要一个合适的生产者,或者您正在将数据从RDBMS转储到json中,在这种情况下,您不需要Spark Streaming。(数据我试图在我的代码中打印数据帧dataDStream.foreachRDD((rdd:rdd[String],time:time)=>{val newdata=rdd.flatMap(x=>x.split(“,”)newdata.foreach(println)