Python 卡夫卡：如何使用基于时间戳的数据_Python_Apache Kafka

Python 卡夫卡：如何使用基于时间戳的数据

python apache-kafka

Python 卡夫卡：如何使用基于时间戳的数据,python,apache-kafka,Python,Apache Kafka,我想知道除了偏移量之外，是否还有其他方法来获取关于时间间隔的数据？比方说，我想消耗昨天的所有日期，我该怎么做？您可以找到指定时间间隔开始处的最早偏移量，然后回放到此偏移量。然而，很难理解时间间隔的终点在哪里，因为具有最早时间戳的记录可能稍后到达。因此，您可以从间隔开始使用记录，直到找到时间戳晚于结束时间的记录，再加上一些记录以捕获延迟的消息倒带到startTime的代码为： public void rewind(DateTime time) { Set<TopicPartitio

我想知道除了偏移量之外，是否还有其他方法来获取关于时间间隔的数据？比方说，我想消耗昨天的所有日期，我该怎么做？

您可以找到指定时间间隔开始处的最早偏移量，然后回放到此偏移量。然而，很难理解时间间隔的终点在哪里，因为具有最早时间戳的记录可能稍后到达。因此，您可以从间隔开始使用记录，直到找到时间戳晚于结束时间的记录，再加上一些记录以捕获延迟的消息

倒带到startTime的代码为：

public void rewind(DateTime time) {
    Set<TopicPartition> assignments = consumer.assignment();
    Map<TopicPartition, Long> query = new HashMap<>();
    for (TopicPartition topicPartition : assignments) {
        query.put(topicPartition, time.getMillis());
    }
    Map<TopicPartition, OffsetAndTimestamp> result = consumer.offsetsForTimes(query);

    result.entrySet().stream().forEach(entry -> consumer.seek(entry.getKey(),
            Optional.ofNullable(entry.getValue()).map(OffsetAndTimestamp::offset).orElse(new Long(0))));
}

public void回放（日期时间）{
Set assignments=consumer.assignment（）；
Map query=newhashmap（）；
用于（主题分区主题分区：分配）{
put（topicPartition，time.getMillis（））；
}
映射结果=consumer.offsetsForTimes（查询）；
result.entrySet（）.stream（）.forEach（entry->consumer.seek（entry.getKey（）），
可选.ofNullable（entry.getValue（））.map（OffsetAndTimestamp:：offset）.orElse（newlong（0））；
}

用于获取与所需时间戳相关的正确偏移量。在Python中，如下所示：

from datetime import datetime
from kafka import KafkaConsumer, TopicPartition

topic  = "www.kilskil.com" 
broker = "localhost:9092"

# lets check messages of the first day in New Year
date_in  = datetime(2019,1,1)
date_out = datetime(2019,1,2)

consumer = KafkaConsumer(topic, bootstrap_servers=broker, enable_auto_commit=True)
consumer.poll()  # we need to read message or call dumb poll before seeking the right position

tp      = TopicPartition(topic, 0) # partition n. 0
# in simple case without any special kafka configuration there is only one partition for each topic channel
# and it's number is 0

# in fact you asked about how to use 2 methods: offsets_for_times() and seek()
rec_in  = consumer.offsets_for_times({tp:date_in.timestamp() * 1000})
rec_out = consumer.offsets_for_times({tp:date_out.timestamp() * 1000})

consumer.seek(tp, rec_in[tp].offset) # lets go to the first message in New Year!

c = 0
for msg in consumer:
  if msg.offset >= rec_out[tp].offset:
    break

  c += 1
  # message also has .timestamp field

print("{c} messages between {_in} and {_out}".format(c=c, _in=str(date_in), _out=str(date_out)))

不要忘记，卡夫卡以毫秒为单位度量时间戳，并且它具有long类型。Python lib datetime以秒为单位返回时间戳，因此我们需要将其乘以1000。方法

offset\u for_times

返回带有

TopicPartition

键和

OffsetAndTimestamp

值的dict。两步过程：找到与日期范围相对应的偏移范围，然后（按偏移量）使用它们@Thilo谢谢你的评论，我确实看到了那条旧线索，我想知道是否发生了任何变化。。这意味着我将偏移量详细信息存储在卡夫卡之外的某个地方，并基于此查询卡夫卡，对吗？不一定。最近的卡夫卡版本在所有消息上都有时间戳。所以不需要卡夫卡以外的存储。@Thilo您能帮我找到一些示例来访问带有时间戳的消息吗？我正在用Python实现，但Java示例可以工作。。我还没有找到自己的位置。我不熟悉Python API，但是如果您在使用消息时可以访问记录元数据（偏移量、分区号），那么也应该有时间戳。有没有办法处理date_out>latest offset timestamp场景？看起来它现在会失败，你测试过吗？