Apache spark Spark Streaming有条件地保存到HBase_Apache Spark_Hbase_Apache Kafka_Spark Streaming

Apache spark Spark Streaming有条件地保存到HBase

apache-spark hbase apache-kafka

Apache spark Spark Streaming有条件地保存到HBase,apache-spark,hbase,apache-kafka,spark-streaming,Apache Spark,Hbase,Apache Kafka,Spark Streaming,我想做的是，整合卡夫卡来激发流媒体，并检查字数。如果字数达到某个阈值，比如说5，则将其保存到HBase中，并使用word，count。问题在于，当阈值达到时，它会以窗口大小（2秒）一次又一次地将数据重新发送到HBase 我以PairRDD的形式存储变量，这样（String，（Long，String）），也就是（Word，（Count，Send/Sent/NoOp））。如果字数达到阈值，则会将x._2._2更改为Send，并执行另一个映射操作以将“Send”签名字保存到HBase并将其转换为“Se

我想做的是，整合卡夫卡来激发流媒体，并检查字数。如果字数达到某个阈值，比如说5，则将其保存到HBase中，并使用word，count。问题在于，当阈值达到时，它会以窗口大小（2秒）一次又一次地将数据重新发送到HBase

我以PairRDD的形式存储变量，这样（String，（Long，String）），也就是（Word，（Count，Send/Sent/NoOp））。如果字数达到阈值，则会将x._2._2更改为Send，并执行另一个映射操作以将“Send”签名字保存到HBase并将其转换为“Send”

我的代码如下：

def reduceFunc(x: (Long,String), y: (Long,String)): (Long, String) = {
    var result: (Long, String) = null
    var op = "nop"
    var count = x._1+y._1
    if (x._1<=5 && count>5 ) {
      op = "send"
    }
    result = (count, op)
    result
  }
.
.
.
var wordCounts = words.map(x => (x, (1L, "nop")))
      .reduceByKeyAndWindow(reduceFunc, reduceFunc_inv, Minutes(10), Seconds(2), 2)
//Filter only send states
var toAlert = wordCounts.filter(x => x._2._2 == "send")
// For each word to send, save into HBase
toAlert.foreachRDD{ rdd => 
              val now = Calendar.getInstance.getTimeInMillis
              val hbaseConf = HBaseConfiguration.create()
              hbaseConf.set("hbase.zookeeper.quorum", quorum)
              HBaseAdmin.checkHBaseAvailable(hbaseConf)
              val tableName = "test"
              val table = new HTable(hbaseConf, tableName)

              val resultRdd = rdd.map { tuple =>  (tuple._1.asInstanceOf[String].getBytes(), 
                        scala.collection.immutable.Map( columnFamily -> 
                                Array( ( columnName, (tuple._2._1.toString, now) ) ) 
                        ) 
                  )
              }
              sendToHBase(resultRdd, hbaseConf, tableName)
}
//Save operation is done. Therefore, change the send states to sent.
wordCounts = wordCounts.map{ x => 
    if (x._2._2 == "send")
      (x._1, (x._2._1,"sent"))
    else
      x
 }

def reduceFunc（x:（长，字符串），y:（长，字符串））：（长，字符串）={
变量结果：（长，字符串）=null
var op=“nop”
var计数=x._1+y._1
如果（x._15）{
op=“发送”
}
结果=（计数，op）
结果
}
.
.
.
var wordCounts=words.map（x=>（x，（1L，“nop”））
.reduceByKeyAndWindow（reduceFunc，reduceFunc_inv，分（10），秒（2），2）
//仅筛选发送状态
var toAlert=wordCounts.filter（x=>x._2._2==“发送”）
//对于要发送的每个单词，请保存到HBase中
toAlert.foreachRDD{rdd=>
val now=Calendar.getInstance.getTimeInMillis
val hbaseConf=HBaseConfiguration.create（）
hbaseConf.set（“hbase.zookeeper.quorum”，quorum）
hbasadmin.checkHBaseAvailable（hbaseConf）
val tableName=“测试”
val table=新的HTable（hbaseConf，tableName）
val resultRdd=rdd.map{tuple=>（tuple.\u 1.asInstanceOf[String].getBytes（），
scala.collection.immutable.Map（columnFamily->
数组（（columnName，（tuple.\u 2.\u 1.toString，现在）））
) 
)
}
sendToHBase（结果dd、hbaseConf、表名）
}
//保存操作完成。因此，请将发送状态更改为“已发送”。
wordCounts=wordCounts.map{x=>
如果（x._2._2==“发送”）
（x._1，（x._2._1，“已发送”））
其他的
x
}

它工作并将超过阈值的单词保存到HBase中。但是，当一个单词达到6时，它会每2秒更新一次HBase表。一旦单词达到7，它就会放弃重新发送

我不知道我错过了什么。我还尝试了updateStateByKey，发现它非常慢，可能会丢失一些输入。如果我发送的字太多太快，它会把它们算作一个。也许还有别的办法

提前谢谢

如果您需要，我的安装是带有12个节点的Cloudera CDH 5.9.0。Spark版本为1.6.0，HBase版本为1.2.0

真的吗？没有人关心hadoop和spark world？真的吗？没有人关心hadoop和spark世界？