Events 将事件缩减为时间间隔_Events_Logging_Mapreduce_Reducing

Events 将事件缩减为时间间隔

events logging mapreduce

Events 将事件缩减为时间间隔,events,logging,mapreduce,reducing,Events,Logging,Mapreduce,Reducing,情景：我有一个记录事件的服务，如以下CSV示例所示： #TimeStamp, Name, ColorOfPullover TimeStamp01, Peter, Green TimeStamp02, Bob, Blue TimeStamp03, Peter, Green TimeStamp04, Peter, Red TimeStamp05, Peter, Green 彼得穿绿色衣服等事件经常连续发生我有两个目标：使数据尽可能小保留所有相关数据相关的意思是：我需要知道，在哪个时间跨度

情景：我有一个记录事件的服务，如以下CSV示例所示：

#TimeStamp, Name, ColorOfPullover
TimeStamp01, Peter, Green
TimeStamp02, Bob, Blue
TimeStamp03, Peter, Green
TimeStamp04, Peter, Red
TimeStamp05, Peter, Green

彼得穿绿色衣服等事件经常连续发生

我有两个目标：

使数据尽可能小

保留所有相关数据

相关的意思是：我需要知道，在哪个时间跨度内一个人穿的是什么颜色的。例如：

#StartTime, EndTime, Name, ColorOfPullover
TimeStamp01, TimeStamp03, Peter, Green
TimeStamp02, TimeStamp02, Bob, Blue
TimeStamp03, TimeStamp03, Peter, Green
TimeStamp04, TimeStamp04, Peter, Red
TimeStamp05, TimeStamp05, Peter, Green

在这种形式下，我可以回答这样的问题：彼得在2002年时穿什么颜色的衣服？（我可以放心地假设，在两次记录的相同颜色事件之间，每个人都穿着相同的颜色。）

主要问题：我可以使用现有的技术来实现这一点吗？也就是说，我可以向它提供连续的事件流，它提取并存储相关数据

准确地说，我需要实现这样的算法（伪代码）。CSV示例的每一行都会调用

OnNewEvent

方法。其中参数

event

已经包含来自行的数据作为成员变量

def OnNewEvent(even)
    entry = Database.getLatestEntryFor(event.personName)
    if (entry.pulloverColor == event.pulloverColor)
        entry.setIntervalEndDate(event.date)
        Database.store(entry)
    else
        newEntry = new Entry
        newEntry.setIntervalStartDate(event.date)
        newEntry.setIntervalEndDate(event.date)
        newEntry.setPulloverColor(event.pulloverColor))
        newEntry.setName(event.personName)
        Database.createNewEntry(newEntry)
    end
end

Hbase：

Aerospike:

一种方法是使用HiveMQ。HiveMQ是一种基于MQTT的消息队列技术。它的好处是，您可以编写自定义插件来处理传入的消息。要获取个人事件的最新条目，HiveMQ插件中的哈希表可以正常工作。如果不同的人的数量非常大，我会考虑使用像RIDIS这样的缓存来缓存每个人的最新事件。您的服务将事件发布到HiveMQ。 HiveMQ插件处理传入事件，并对数据库进行更新

应该可以使用logstash来完成，但问题是，您必须对每行执行elasticsearch请求以检索最新条目，这将使过程非常缓慢。这就是为什么我不认为logstash是实现这一点的合适工具。您的数据量是多少？当新事件发生时，您需要以多快的速度作出反应？如果某些事件丢失，可以吗？对事件的反应可能很慢。例如，可接受1天的延迟。因此，一天一个cron工作可能是一种选择。事件可能不会丢失，这是至关重要的。两个链接都是Brokersory mate，黑客们玩得很开心，刚刚修复了网站。请随意浏览这些示例。如果你需要更多的澄清，请告诉我

This is typical scenario of any streaming architecture.  

There are multiple existing technologies which work in tandem  to get what you want. 


1.  NoSql Database (Hbase, Aerospike, Cassandra)
2.  streaming jobs Like Spark streaming(micro batch), Storm 
3.  Run mapreduce in micro batch to insert into NoSql Database.
4.  Kafka Distriuted queue

The end to end flow. 

Data -> streaming framework -> NoSql Database. 
OR 
Data -> Kafka -> streaming framework -> NoSql Database. 


IN NoSql database there are two ways to model your data. 
1. Key by "Name" and for every event for that given key, insert into Database.
   While fetching u get back all events corresponding to that key. 

2. Key by "name", every time a event for key is there, do a UPSERT into a existing blob(Object saved as binary), Inside the blob you maintain the time range and color seen.  

Code sample to read and write to Hbase and Aerospike