Filter 过滤数据风暴
我有一个简单的Storm拓扑,它从Kafka读取数据,解析和提取消息字段。我想通过一个字段值过滤元组流,并对另一个字段值执行计数聚合。在暴风雨中我怎么做? 我还没有找到元组(filter,aggregate)的相应方法,所以我应该直接对字段值执行这些函数吗 以下是一个拓扑:Filter 过滤数据风暴,filter,apache-storm,topology,Filter,Apache Storm,Topology,我有一个简单的Storm拓扑,它从Kafka读取数据,解析和提取消息字段。我想通过一个字段值过滤元组流,并对另一个字段值执行计数聚合。在暴风雨中我怎么做? 我还没有找到元组(filter,aggregate)的相应方法,所以我应该直接对字段值执行这些函数吗 以下是一个拓扑: topologyBuilder.setSpout("kafka_spout", new KafkaSpout(spoutConfig), 1) topologyBuilder.setBolt("parser_bolt", n
topologyBuilder.setSpout("kafka_spout", new KafkaSpout(spoutConfig), 1)
topologyBuilder.setBolt("parser_bolt", new ParserBolt()).shuffleGrouping("kafka_spout")
topologyBuilder.setBolt("transformer_bolt", new KafkaTwitterBolt()).shuffleGrouping("parser_bolt")
val config = new Config()
cluster.submitTopology("kafkaTest", config, topologyBuilder.createTopology())
我已经设置了KafkaTwitterBolt,用于使用解析字段进行计数和过滤。我已设法仅按特定字段过滤整个值列表:
class KafkaTwitterBolt() extends BaseBasicBolt{
override def execute(input: Tuple, collector: BasicOutputCollector): Unit = {
val tweetValues = input.getValues.asScala.toList
val filterTweets = tweetValues
.map(_.toString)
.filter(_ contains "big data")
val resultAllValues = new Values(filterTweets)
collector.emit(resultAllValues)
}
override def declareOutputFields(declarer: OutputFieldsDeclarer): Unit = {
declarer.declare(new Fields("created_at", "id", "text", "source", "timestamp_ms",
"user.id", "user.name", "user.location", "user.url", "user.description", "user.followers_count",
"user.friends_count", "user.lang", "user.favorite_count", "entities.hashtags"))
}
}
事实证明,Storm core API不允许这样做,为了在任何现场执行过滤,应使用Trident(它具有内置过滤功能)。 代码如下所示:
val tridentTopology = new TridentTopology()
val stream = tridentTopology.newStream("kafka_spout",
new KafkaTridentSpoutOpaque(spoutConfig))
.map(new ParserMapFunction, new Fields("created_at", "id", "text", "source", "timestamp_ms",
"user.id", "user.name", "user.location", "user.url", "user.description", "user.followers_count",
"user.friends_count", "user.favorite_count", "user.lang", "entities.hashtags"))
.filter(new LanguageFilter)
过滤功能本身:
class LanguageFilter extends BaseFilter{
override def isKeep(tuple: TridentTuple): Boolean = {
val language = tuple.getStringByField("user.lang")
println(s"TWEET: $language")
language.contains("en")
}
}
你当时的回答有点错。Storm core API允许过滤和聚合,您只需自己编写逻辑即可
过滤螺栓只是一个螺栓,它丢弃一些元组,并传递其他元组。例如,以下螺栓将根据字符串字段过滤出元组:
class FilteringBolt() extends BaseBasicBolt{
override def execute(input: Tuple, collector: BasicOutputCollector): Unit = {
val values = input.getValues.asScala.toList
if ("Pass me".equals(values.get(0))) {
collector.emit(values)
}
//Emitting nothing means discarding the tuple
}
override def declareOutputFields(declarer: OutputFieldsDeclarer): Unit = {
declarer.declare(new Fields("some-field"))
}
}
聚合螺栓只是收集多个元组的螺栓,然后发出锚定在原始元组中的新聚合元组:
class AggregatingBolt extends BaseRichBolt {
List<Tuple> tuplesToAggregate = ...;
int counter = 0;
override def execute(input: Tuple): Unit = {
tuplesToAggregate.add(input);
counter++;
if (counter == 10) {
Values aggregateTuple = ... //create a new set of values based on tuplesToAggregate
collector.emit(tuplesToAggregate, aggregateTuple) //This anchors the new aggregate tuple to all the original tuples, so if the aggregate fails, the original tuples are replayed.
for (Tuple t : tuplesToAggregate) {
collector.ack(t); //Ack the original tuples now that this bolt is done with them
//Note that you MUST emit before you ack, or the at-least-once guarantee will be broken.
}
tuplesToAggregate.clear();
counter = 0;
}
//Note that we don't ack the input tuples until the aggregate gets emitted. This lets us replay all the aggregated tuples in case the aggregate fails
}
}
class AggregatingBolt扩展了BaseRichBolt{
列表元组集合=。。。;
int计数器=0;
覆盖def execute(输入:元组):单位={
添加(输入);
计数器++;
如果(计数器==10){
Values aggregateTuple=…//基于TupleToAggregate创建一组新值
emit(tuplesToAggregate,aggregateTuple)//这将新的聚合元组锚定到所有原始元组,因此,如果聚合失败,将重放原始元组。
for(Tuple t:tuplesToAggregate){
collector.ack(t);//既然这个螺栓已经完成了,就确认原始元组
//请注意,您必须在确认之前发出,否则至少一次保证将被破坏。
}
tuplesToAggregate.clear();
计数器=0;
}
//请注意,在发出聚合之前,我们不会确认输入元组。这允许我们在聚合失败时重播所有聚合元组
}
}
请注意,对于聚合,您需要扩展BaseRichBolt
并手动执行acking,因为您希望延迟对元组进行acking,直到它包含在聚合元组中