从Beam/数据流读取RabbitMQ
我正在尝试以流方式从beam/dataflow运行RabbitMQ队列,以便它无限期地运行 我尝试运行的最小示例代码是:从Beam/数据流读取RabbitMQ,rabbitmq,google-cloud-dataflow,apache-beam,Rabbitmq,Google Cloud Dataflow,Apache Beam,我正在尝试以流方式从beam/dataflow运行RabbitMQ队列,以便它无限期地运行 我尝试运行的最小示例代码是: import org.apache.beam.sdk.Pipeline; import org.apache.beam.sdk.io.rabbitmq.RabbitMqIO; import org.apache.beam.sdk.io.rabbitmq.RabbitMqMessage; import org.apache.beam.sdk.transforms.DoFn; i
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.io.rabbitmq.RabbitMqIO;
import org.apache.beam.sdk.io.rabbitmq.RabbitMqMessage;
import org.apache.beam.sdk.transforms.DoFn;
import org.apache.beam.sdk.transforms.ParDo;
public class RabbitMqTest {
public static void main(String[] args) {
Pipeline pipeline = Pipeline.create();
final String serverUri = "amqp://guest:guest@localhost:5672";
pipeline
.apply("Read RabbitMQ message", RabbitMqIO.read().withUri(serverUri).withQueue("my_queue"))
.apply(ParDo.of(new DoFn<RabbitMqMessage, String>() {
@ProcessElement
public void processElement(ProcessContext c) {
String message = new String(c.element().getBody());
System.out.println();
c.output(message);
}
}));
pipeline.run().waitUntilFinish();
}
}
如果我不将withMaxReadTime传递给RabbitMqIO。
如果我传入withMaxReadTime,它会阻塞X秒,然后处理在此期间到达的所有消息,然后退出
我如何设置一个流式流,使其无限期地从RabbitMQ运行?我对数据流管道也有类似的问题。尝试在数据流中运行时,我得到:
java.lang.NullPointerException
org.apache.beam.runners.dataflow.worker.WindmillTimeUtils.harnessToWindmillTimestamp(WindmillTimeUtils.java:58)
org.apache.beam.runners.dataflow.worker.StreamingModeExecutionContext.flushState(StreamingModeExecutionContext.java:400)
org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.process(StreamingDataflowWorker.java:1230)
org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.access$1000(StreamingDataflowWorker.java:143)
org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker$6.run(StreamingDataflowWorker.java:967)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
java.lang.Thread.run(Thread.java:745)
问题是RabbitMqIO使用来自RabbitMq的消息的时间戳,例如用于水印。事实证明,在我的例子中,来自RabbitMq的消息没有设置时间戳。RabbitMq中默认情况下没有设置时间戳,它为null。我通过为ApacheBeam中的类准备补丁修复了这个问题。我在RabbitMqMessage构造函数中进行了更改。现在看起来是这样的:
public RabbitMqMessage(String routingKey, QueueingConsumer.Delivery delivery) {
this.routingKey = routingKey;
body = delivery.getBody();
contentType = delivery.getProperties().getContentType();
contentEncoding = delivery.getProperties().getContentEncoding();
headers = delivery.getProperties().getHeaders();
deliveryMode = delivery.getProperties().getDeliveryMode();
priority = delivery.getProperties().getPriority();
correlationId = delivery.getProperties().getCorrelationId();
replyTo = delivery.getProperties().getReplyTo();
expiration = delivery.getProperties().getExpiration();
messageId = delivery.getProperties().getMessageId();
/*
*** IMPORTANT ***
Sometimes timestamp in RabbitMq message properties is 'null'. `RabbitMqIO` uses that value as
watermark, when it is `null` it causes exceptions, 'null' has to be replaced with some value in this case current time
*/
// timestamp = delivery.getProperties().getTimestamp();
timestamp = delivery.getProperties().getTimestamp() == null ? new Date() : delivery.getProperties().getTimestamp();
type = delivery.getProperties().getType();
userId = delivery.getProperties().getUserId();
appId = delivery.getProperties().getAppId();
clusterId = delivery.getProperties().getClusterId();
}
private static class RabbitMQCheckpointMark
implements UnboundedSource.CheckpointMark, Serializable {
transient Channel channel;
/*
*** IMPORTANT *** it should be initialized with some value because without it runner (e.g Dataflow) fails with 'NullPointerException'
Example error:
java.lang.NullPointerException
org.apache.beam.runners.dataflow.worker.WindmillTimeUtils.harnessToWindmillTimestamp(WindmillTimeUtils.java:58)
org.apache.beam.runners.dataflow.worker.StreamingModeExecutionContext.flushState(StreamingModeExecutionContext.java:400)
org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.process(StreamingDataflowWorker.java:1230)
org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.access$1000(StreamingDataflowWorker.java:143)
org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker$6.run(StreamingDataflowWorker.java:967)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
java.lang.Thread.run(Thread.java:745)
*/
Instant oldestTimestamp = new Instant(Long.MIN_VALUE);
final List<Long> sessionIds = new ArrayList<>();
@Override
public void finalizeCheckpoint() throws IOException {
for (Long sessionId : sessionIds) {
channel.basicAck(sessionId, false);
}
channel.txCommit();
oldestTimestamp = Instant.now();
sessionIds.clear();
}
}
我必须更改RabbitMqIO中的advance方法,以不使用可能为null的timestamp属性:
@Override
public boolean advance() throws IOException {
try {
QueueingConsumer.Delivery delivery = consumer.nextDelivery(1000);
if (delivery == null) {
return false;
}
if (source.spec.useCorrelationId()) {
String correlationId = delivery.getProperties().getCorrelationId();
if (correlationId == null) {
throw new IOException(
"RabbitMqIO.Read uses message correlation ID, but received "
+ "message has a null correlation ID");
}
currentRecordId = correlationId.getBytes(StandardCharsets.UTF_8);
}
long deliveryTag = delivery.getEnvelope().getDeliveryTag();
checkpointMark.sessionIds.add(deliveryTag);
current = new RabbitMqMessage(source.spec.routingKey(), delivery);
/*
*** IMPORTANT ***
Sometimes timestamp in RabbitMq messages is 'null' stream in Dataflow fails because
watermark is based on that value, 'null' has to be replaced with some value. `RabbitMqMessage` was changed
to use `new Date()` in this situation and now timestamp can be taken from it
*/
//currentTimestamp = new Instant(delivery.getProperties().getTimestamp());
currentTimestamp = new Instant(current.getTimestamp());
if (currentTimestamp.isBefore(checkpointMark.oldestTimestamp)) {
checkpointMark.oldestTimestamp = currentTimestamp;
}
} catch (Exception e) {
throw new IOException(e);
}
return true;
}
在再次运行管道后,我在另一个地方再次遇到此异常。这一次是由于没有在RabbitMQCheckpointMark中为OldTestTimeStamp属性设置默认值造成的。我做了下一个更改,现在RabbitMQCheckpointMark看起来如下:
public RabbitMqMessage(String routingKey, QueueingConsumer.Delivery delivery) {
this.routingKey = routingKey;
body = delivery.getBody();
contentType = delivery.getProperties().getContentType();
contentEncoding = delivery.getProperties().getContentEncoding();
headers = delivery.getProperties().getHeaders();
deliveryMode = delivery.getProperties().getDeliveryMode();
priority = delivery.getProperties().getPriority();
correlationId = delivery.getProperties().getCorrelationId();
replyTo = delivery.getProperties().getReplyTo();
expiration = delivery.getProperties().getExpiration();
messageId = delivery.getProperties().getMessageId();
/*
*** IMPORTANT ***
Sometimes timestamp in RabbitMq message properties is 'null'. `RabbitMqIO` uses that value as
watermark, when it is `null` it causes exceptions, 'null' has to be replaced with some value in this case current time
*/
// timestamp = delivery.getProperties().getTimestamp();
timestamp = delivery.getProperties().getTimestamp() == null ? new Date() : delivery.getProperties().getTimestamp();
type = delivery.getProperties().getType();
userId = delivery.getProperties().getUserId();
appId = delivery.getProperties().getAppId();
clusterId = delivery.getProperties().getClusterId();
}
private static class RabbitMQCheckpointMark
implements UnboundedSource.CheckpointMark, Serializable {
transient Channel channel;
/*
*** IMPORTANT *** it should be initialized with some value because without it runner (e.g Dataflow) fails with 'NullPointerException'
Example error:
java.lang.NullPointerException
org.apache.beam.runners.dataflow.worker.WindmillTimeUtils.harnessToWindmillTimestamp(WindmillTimeUtils.java:58)
org.apache.beam.runners.dataflow.worker.StreamingModeExecutionContext.flushState(StreamingModeExecutionContext.java:400)
org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.process(StreamingDataflowWorker.java:1230)
org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.access$1000(StreamingDataflowWorker.java:143)
org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker$6.run(StreamingDataflowWorker.java:967)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
java.lang.Thread.run(Thread.java:745)
*/
Instant oldestTimestamp = new Instant(Long.MIN_VALUE);
final List<Long> sessionIds = new ArrayList<>();
@Override
public void finalizeCheckpoint() throws IOException {
for (Long sessionId : sessionIds) {
channel.basicAck(sessionId, false);
}
channel.txCommit();
oldestTimestamp = Instant.now();
sessionIds.clear();
}
}
所有这些更改修复了我的管道,现在它按预期工作。我希望您会发现它很有用。这是Io中的一个错误,已经被删除