Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Apache spark Spark Streaming有条件地保存到HBase_Apache Spark_Hbase_Apache Kafka_Spark Streaming - Fatal编程技术网

Apache spark Spark Streaming有条件地保存到HBase

Apache spark Spark Streaming有条件地保存到HBase,apache-spark,hbase,apache-kafka,spark-streaming,Apache Spark,Hbase,Apache Kafka,Spark Streaming,我想做的是,整合卡夫卡来激发流媒体,并检查字数。如果字数达到某个阈值,比如说5,则将其保存到HBase中,并使用word,count。问题在于,当阈值达到时,它会以窗口大小(2秒)一次又一次地将数据重新发送到HBase 我以PairRDD的形式存储变量,这样(String,(Long,String)),也就是(Word,(Count,Send/Sent/NoOp))。如果字数达到阈值,则会将x._2._2更改为Send,并执行另一个映射操作以将“Send”签名字保存到HBase并将其转换为“Se

我想做的是,整合卡夫卡来激发流媒体,并检查字数。如果字数达到某个阈值,比如说5,则将其保存到HBase中,并使用word,count。问题在于,当阈值达到时,它会以窗口大小(2秒)一次又一次地将数据重新发送到HBase

我以PairRDD的形式存储变量,这样(String,(Long,String)),也就是(Word,(Count,Send/Sent/NoOp))。如果字数达到阈值,则会将x._2._2更改为Send,并执行另一个映射操作以将“Send”签名字保存到HBase并将其转换为“Send”

我的代码如下:

def reduceFunc(x: (Long,String), y: (Long,String)): (Long, String) = {
    var result: (Long, String) = null
    var op = "nop"
    var count = x._1+y._1
    if (x._1<=5 && count>5 ) {
      op = "send"
    }
    result = (count, op)
    result
  }
.
.
.
var wordCounts = words.map(x => (x, (1L, "nop")))
      .reduceByKeyAndWindow(reduceFunc, reduceFunc_inv, Minutes(10), Seconds(2), 2)
//Filter only send states
var toAlert = wordCounts.filter(x => x._2._2 == "send")
// For each word to send, save into HBase
toAlert.foreachRDD{ rdd => 
              val now = Calendar.getInstance.getTimeInMillis
              val hbaseConf = HBaseConfiguration.create()
              hbaseConf.set("hbase.zookeeper.quorum", quorum)
              HBaseAdmin.checkHBaseAvailable(hbaseConf)
              val tableName = "test"
              val table = new HTable(hbaseConf, tableName)

              val resultRdd = rdd.map { tuple =>  (tuple._1.asInstanceOf[String].getBytes(), 
                        scala.collection.immutable.Map( columnFamily -> 
                                Array( ( columnName, (tuple._2._1.toString, now) ) ) 
                        ) 
                  )
              }
              sendToHBase(resultRdd, hbaseConf, tableName)
}
//Save operation is done. Therefore, change the send states to sent.
wordCounts = wordCounts.map{ x => 
    if (x._2._2 == "send")
      (x._1, (x._2._1,"sent"))
    else
      x
 }
def reduceFunc(x:(长,字符串),y:(长,字符串)):(长,字符串)={
变量结果:(长,字符串)=null
var op=“nop”
var计数=x._1+y._1
如果(x._15){
op=“发送”
}
结果=(计数,op)
结果
}
.
.
.
var wordCounts=words.map(x=>(x,(1L,“nop”))
.reduceByKeyAndWindow(reduceFunc,reduceFunc_inv,分(10),秒(2),2)
//仅筛选发送状态
var toAlert=wordCounts.filter(x=>x._2._2==“发送”)
//对于要发送的每个单词,请保存到HBase中
toAlert.foreachRDD{rdd=>
val now=Calendar.getInstance.getTimeInMillis
val hbaseConf=HBaseConfiguration.create()
hbaseConf.set(“hbase.zookeeper.quorum”,quorum)
hbasadmin.checkHBaseAvailable(hbaseConf)
val tableName=“测试”
val table=新的HTable(hbaseConf,tableName)
val resultRdd=rdd.map{tuple=>(tuple.\u 1.asInstanceOf[String].getBytes(),
scala.collection.immutable.Map(columnFamily->
数组((columnName,(tuple.\u 2.\u 1.toString,现在)))
) 
)
}
sendToHBase(结果dd、hbaseConf、表名)
}
//保存操作完成。因此,请将发送状态更改为“已发送”。
wordCounts=wordCounts.map{x=>
如果(x._2._2==“发送”)
(x._1,(x._2._1,“已发送”))
其他的
x
}
它工作并将超过阈值的单词保存到HBase中。但是,当一个单词达到6时,它会每2秒更新一次HBase表。一旦单词达到7,它就会放弃重新发送

我不知道我错过了什么。我还尝试了updateStateByKey,发现它非常慢,可能会丢失一些输入。如果我发送的字太多太快,它会把它们算作一个。也许还有别的办法

提前谢谢


如果您需要,我的安装是带有12个节点的Cloudera CDH 5.9.0。Spark版本为1.6.0,HBase版本为1.2.0

真的吗?没有人关心hadoop和spark world?真的吗?没有人关心hadoop和spark世界?