Apache spark Spark Streaming有条件地保存到HBase
我想做的是,整合卡夫卡来激发流媒体,并检查字数。如果字数达到某个阈值,比如说5,则将其保存到HBase中,并使用word,count。问题在于,当阈值达到时,它会以窗口大小(2秒)一次又一次地将数据重新发送到HBase 我以PairRDD的形式存储变量,这样(String,(Long,String)),也就是(Word,(Count,Send/Sent/NoOp))。如果字数达到阈值,则会将x._2._2更改为Send,并执行另一个映射操作以将“Send”签名字保存到HBase并将其转换为“Send” 我的代码如下:Apache spark Spark Streaming有条件地保存到HBase,apache-spark,hbase,apache-kafka,spark-streaming,Apache Spark,Hbase,Apache Kafka,Spark Streaming,我想做的是,整合卡夫卡来激发流媒体,并检查字数。如果字数达到某个阈值,比如说5,则将其保存到HBase中,并使用word,count。问题在于,当阈值达到时,它会以窗口大小(2秒)一次又一次地将数据重新发送到HBase 我以PairRDD的形式存储变量,这样(String,(Long,String)),也就是(Word,(Count,Send/Sent/NoOp))。如果字数达到阈值,则会将x._2._2更改为Send,并执行另一个映射操作以将“Send”签名字保存到HBase并将其转换为“Se
def reduceFunc(x: (Long,String), y: (Long,String)): (Long, String) = {
var result: (Long, String) = null
var op = "nop"
var count = x._1+y._1
if (x._1<=5 && count>5 ) {
op = "send"
}
result = (count, op)
result
}
.
.
.
var wordCounts = words.map(x => (x, (1L, "nop")))
.reduceByKeyAndWindow(reduceFunc, reduceFunc_inv, Minutes(10), Seconds(2), 2)
//Filter only send states
var toAlert = wordCounts.filter(x => x._2._2 == "send")
// For each word to send, save into HBase
toAlert.foreachRDD{ rdd =>
val now = Calendar.getInstance.getTimeInMillis
val hbaseConf = HBaseConfiguration.create()
hbaseConf.set("hbase.zookeeper.quorum", quorum)
HBaseAdmin.checkHBaseAvailable(hbaseConf)
val tableName = "test"
val table = new HTable(hbaseConf, tableName)
val resultRdd = rdd.map { tuple => (tuple._1.asInstanceOf[String].getBytes(),
scala.collection.immutable.Map( columnFamily ->
Array( ( columnName, (tuple._2._1.toString, now) ) )
)
)
}
sendToHBase(resultRdd, hbaseConf, tableName)
}
//Save operation is done. Therefore, change the send states to sent.
wordCounts = wordCounts.map{ x =>
if (x._2._2 == "send")
(x._1, (x._2._1,"sent"))
else
x
}
def reduceFunc(x:(长,字符串),y:(长,字符串)):(长,字符串)={
变量结果:(长,字符串)=null
var op=“nop”
var计数=x._1+y._1
如果(x._15){
op=“发送”
}
结果=(计数,op)
结果
}
.
.
.
var wordCounts=words.map(x=>(x,(1L,“nop”))
.reduceByKeyAndWindow(reduceFunc,reduceFunc_inv,分(10),秒(2),2)
//仅筛选发送状态
var toAlert=wordCounts.filter(x=>x._2._2==“发送”)
//对于要发送的每个单词,请保存到HBase中
toAlert.foreachRDD{rdd=>
val now=Calendar.getInstance.getTimeInMillis
val hbaseConf=HBaseConfiguration.create()
hbaseConf.set(“hbase.zookeeper.quorum”,quorum)
hbasadmin.checkHBaseAvailable(hbaseConf)
val tableName=“测试”
val table=新的HTable(hbaseConf,tableName)
val resultRdd=rdd.map{tuple=>(tuple.\u 1.asInstanceOf[String].getBytes(),
scala.collection.immutable.Map(columnFamily->
数组((columnName,(tuple.\u 2.\u 1.toString,现在)))
)
)
}
sendToHBase(结果dd、hbaseConf、表名)
}
//保存操作完成。因此,请将发送状态更改为“已发送”。
wordCounts=wordCounts.map{x=>
如果(x._2._2==“发送”)
(x._1,(x._2._1,“已发送”))
其他的
x
}
它工作并将超过阈值的单词保存到HBase中。但是,当一个单词达到6时,它会每2秒更新一次HBase表。一旦单词达到7,它就会放弃重新发送
我不知道我错过了什么。我还尝试了updateStateByKey,发现它非常慢,可能会丢失一些输入。如果我发送的字太多太快,它会把它们算作一个。也许还有别的办法
提前谢谢
如果您需要,我的安装是带有12个节点的Cloudera CDH 5.9.0。Spark版本为1.6.0,HBase版本为1.2.0 真的吗?没有人关心hadoop和spark world?真的吗?没有人关心hadoop和spark世界?