Cassandra 如何在云数据流中保持与外部数据库的连接

Cassandra 如何在云数据流中保持与外部数据库的连接,cassandra,google-cloud-dataflow,Cassandra,Google Cloud Dataflow,我有一个未绑定的数据流管道,它读取Pub/Sub,应用ParDo并写入Cassandra。它仅应用ParDo转换,因此我使用默认全局窗口和默认触发,即使源未绑定 在这样的管道中,我应该如何保持与卡桑德拉的连接 目前,我将其保存在startBundle中,如下所示: private class CassandraWriter <T> extends DoFn<T, Void> { private transient Cluster cluster; private

我有一个未绑定的数据流管道,它读取Pub/Sub,应用ParDo并写入Cassandra。它仅应用ParDo转换,因此我使用默认全局窗口和默认触发,即使源未绑定

在这样的管道中,我应该如何保持与卡桑德拉的连接

目前,我将其保存在
startBundle
中,如下所示:

private class CassandraWriter <T> extends DoFn<T, Void> {
  private transient Cluster cluster;
  private transient Session session;
  private transient MappingManager mappingManager;

  @Override
  public void startBundle(Context c) {
    this.cluster = Cluster.builder()
        .addContactPoints(hosts)
        .withPort(port)
        .withoutMetrics()
        .withoutJMXReporting()
        .build();
    this.session = cluster.connect(keyspace);
    this.mappingManager = new MappingManager(session);
  }

  @Override
  public void processElement(ProcessContext c) throws IOException {
    T element = c.element();
    Mapper<T> mapper = (Mapper<T>) mappingManager.mapper(element.getClass());
    mapper.save(element);
  }

  @Override
  public void finishBundle(Context c) throws IOException {
    session.close();
    cluster.close();
  }
}
私有类CassandraWriter扩展了DoFn{
私有临时集群;
非公开临时会议;
私有瞬态映射管理器映射管理器;
@凌驾
公共无效开始绑定(上下文c){
this.cluster=cluster.builder()
.addContactPoints(主机)
.withPort(端口)
.withoutMetrics()
.没有JMXReporting()
.build();
this.session=cluster.connect(键空间);
this.mappingManager=新的mappingManager(会话);
}
@凌驾
public void processElement(ProcessContext c)引发IOException{
T元素=c元素();
Mapper Mapper=(Mapper)mappingManager.Mapper(element.getClass());
mapper.save(元素);
}
@凌驾
公共void finishBundle(上下文c)引发IOException{
session.close();
cluster.close();
}
}
但是,这样会为每个元素创建一个新连接

另一个选项是将其作为边输入传递,如:

公共PDone应用(PCollection输入){
管道p=input.getPipeline();
CassandraWriteOperation op=新的CassandraWriteOperation(此);
编码器=
(Coder)SerializableCoder.of(op.getClass());
PCollection opSingleton=
p、 应用(创建(op)).setCoder(编码器);
最终PCollectionView opSingletonView=
opSingleton.apply(View.asSingleton());
PCollection results=input.apply(ParDo.of(new DoFn)(){
@凌驾
public void processElement(ProcessContext c)引发异常{
//在这里使用侧面输入
}
}).使用SideInputs(opSingletonView));
PCollectionView voidView=results.apply(View.asIterable());
opSingleton.apply(ParDo.of(new DoFn)(){
私有静态最终长serialVersionUID=0;
@凌驾
公共void processElement(ProcessContext c){
CassandraWriteOperation op=c.element();
op.finalize();
}
}).带有sideInputs(voidView));
返回新的PDone();
}
但是,这种方式我必须使用窗口,因为
PCollectionView voidView=results.apply(View.asIterable())应用分组依据


一般来说,从无界PCollection写入外部数据库的PTransform应如何保持与数据库的连接?

您可以正确地观察到,流式/无界情况下的典型包大小比批式/有界情况下的小。实际束大小取决于许多参数,有时束可能包含单个元素

解决此问题的一种方法是为每个工作者使用一个连接池,存储在
DoFn的静态中。您应该能够在第一次调用
startBundle
时初始化它,并跨捆绑包使用它。或者,您可以按需创建连接,并在不再需要时将其释放到池中以供重用


您应该确保静态是线程安全的,并且没有对数据流如何管理捆绑包做出任何假设。

正如Davor Bonaci所建议的,使用静态变量解决了问题

public class CassandraWriter<T> extends DoFn<T, Void> {
  private static final Logger log = LoggerFactory.getLogger(CassandraWriter.class);

  // Prevent multiple threads from creating multiple cluster connection in parallel.
  private static transient final Object lock = new Object();
  private static transient Cluster cluster;
  private static transient Session session;
  private static transient MappingManager mappingManager;

  private final String[] hosts;
  private final int port;
  private final String keyspace;

  public CassandraWriter(String[] hosts, int port, String keyspace) {
    this.hosts = hosts;
    this.port = port;
    this.keyspace = keyspace;
  }

  @Override
  public void startBundle(Context c) {
    synchronized (lock) {
      if (cluster == null) {
        cluster = Cluster.builder()
            .addContactPoints(hosts)
            .withPort(port)
            .withoutMetrics()
            .withoutJMXReporting()
            .build();
        session = cluster.connect(keyspace);
        mappingManager = new MappingManager(session);
      }
    }
  }

  @Override
  public void processElement(ProcessContext c) throws IOException {
    T element = c.element();
    Mapper<T> mapper = (Mapper<T>) mappingManager.mapper(element.getClass());
    mapper.save(element);
  }
}
公共类CassandraWriter扩展了DoFn{
私有静态最终记录器log=LoggerFactory.getLogger(CassandraWriter.class);
//防止多个线程并行创建多个群集连接。
私有静态瞬态最终对象锁=新对象();
私有静态瞬态集群;
私有静态瞬态会话;
私有静态瞬态映射管理器映射管理器;
私有最终字符串[]主机;
私人最终国际港口;
私有最终字符串键空间;
公共CassandraWriter(字符串[]主机、int端口、字符串键空间){
this.hosts=hosts;
this.port=端口;
this.keyspace=keyspace;
}
@凌驾
公共无效开始绑定(上下文c){
已同步(锁定){
if(cluster==null){
cluster=cluster.builder()
.addContactPoints(主机)
.withPort(端口)
.withoutMetrics()
.没有JMXReporting()
.build();
会话=cluster.connect(键空间);
mappingManager=新的mappingManager(会话);
}
}
}
@凌驾
public void processElement(ProcessContext c)引发IOException{
T元素=c元素();
Mapper Mapper=(Mapper)mappingManager.Mapper(element.getClass());
mapper.save(元素);
}
}

谢谢。我在考虑静态,但我有点不愿意使用它,因为它不是4种有文档记录的方法之一。
public class CassandraWriter<T> extends DoFn<T, Void> {
  private static final Logger log = LoggerFactory.getLogger(CassandraWriter.class);

  // Prevent multiple threads from creating multiple cluster connection in parallel.
  private static transient final Object lock = new Object();
  private static transient Cluster cluster;
  private static transient Session session;
  private static transient MappingManager mappingManager;

  private final String[] hosts;
  private final int port;
  private final String keyspace;

  public CassandraWriter(String[] hosts, int port, String keyspace) {
    this.hosts = hosts;
    this.port = port;
    this.keyspace = keyspace;
  }

  @Override
  public void startBundle(Context c) {
    synchronized (lock) {
      if (cluster == null) {
        cluster = Cluster.builder()
            .addContactPoints(hosts)
            .withPort(port)
            .withoutMetrics()
            .withoutJMXReporting()
            .build();
        session = cluster.connect(keyspace);
        mappingManager = new MappingManager(session);
      }
    }
  }

  @Override
  public void processElement(ProcessContext c) throws IOException {
    T element = c.element();
    Mapper<T> mapper = (Mapper<T>) mappingManager.mapper(element.getClass());
    mapper.save(element);
  }
}