Cassandra 如何在云数据流中保持与外部数据库的连接
我有一个未绑定的数据流管道,它读取Pub/Sub,应用ParDo并写入Cassandra。它仅应用ParDo转换,因此我使用默认全局窗口和默认触发,即使源未绑定 在这样的管道中,我应该如何保持与卡桑德拉的连接 目前,我将其保存在Cassandra 如何在云数据流中保持与外部数据库的连接,cassandra,google-cloud-dataflow,Cassandra,Google Cloud Dataflow,我有一个未绑定的数据流管道,它读取Pub/Sub,应用ParDo并写入Cassandra。它仅应用ParDo转换,因此我使用默认全局窗口和默认触发,即使源未绑定 在这样的管道中,我应该如何保持与卡桑德拉的连接 目前,我将其保存在startBundle中,如下所示: private class CassandraWriter <T> extends DoFn<T, Void> { private transient Cluster cluster; private
startBundle
中,如下所示:
private class CassandraWriter <T> extends DoFn<T, Void> {
private transient Cluster cluster;
private transient Session session;
private transient MappingManager mappingManager;
@Override
public void startBundle(Context c) {
this.cluster = Cluster.builder()
.addContactPoints(hosts)
.withPort(port)
.withoutMetrics()
.withoutJMXReporting()
.build();
this.session = cluster.connect(keyspace);
this.mappingManager = new MappingManager(session);
}
@Override
public void processElement(ProcessContext c) throws IOException {
T element = c.element();
Mapper<T> mapper = (Mapper<T>) mappingManager.mapper(element.getClass());
mapper.save(element);
}
@Override
public void finishBundle(Context c) throws IOException {
session.close();
cluster.close();
}
}
私有类CassandraWriter扩展了DoFn{
私有临时集群;
非公开临时会议;
私有瞬态映射管理器映射管理器;
@凌驾
公共无效开始绑定(上下文c){
this.cluster=cluster.builder()
.addContactPoints(主机)
.withPort(端口)
.withoutMetrics()
.没有JMXReporting()
.build();
this.session=cluster.connect(键空间);
this.mappingManager=新的mappingManager(会话);
}
@凌驾
public void processElement(ProcessContext c)引发IOException{
T元素=c元素();
Mapper Mapper=(Mapper)mappingManager.Mapper(element.getClass());
mapper.save(元素);
}
@凌驾
公共void finishBundle(上下文c)引发IOException{
session.close();
cluster.close();
}
}
但是,这样会为每个元素创建一个新连接
另一个选项是将其作为边输入传递,如:
公共PDone应用(PCollection输入){
管道p=input.getPipeline();
CassandraWriteOperation op=新的CassandraWriteOperation(此);
编码器=
(Coder)SerializableCoder.of(op.getClass());
PCollection opSingleton=
p、 应用(创建(op)).setCoder(编码器);
最终PCollectionView opSingletonView=
opSingleton.apply(View.asSingleton());
PCollection results=input.apply(ParDo.of(new DoFn)(){
@凌驾
public void processElement(ProcessContext c)引发异常{
//在这里使用侧面输入
}
}).使用SideInputs(opSingletonView));
PCollectionView voidView=results.apply(View.asIterable());
opSingleton.apply(ParDo.of(new DoFn)(){
私有静态最终长serialVersionUID=0;
@凌驾
公共void processElement(ProcessContext c){
CassandraWriteOperation op=c.element();
op.finalize();
}
}).带有sideInputs(voidView));
返回新的PDone();
}
但是,这种方式我必须使用窗口,因为PCollectionView voidView=results.apply(View.asIterable())代码>应用分组依据
一般来说,从无界PCollection写入外部数据库的PTransform应如何保持与数据库的连接?您可以正确地观察到,流式/无界情况下的典型包大小比批式/有界情况下的小。实际束大小取决于许多参数,有时束可能包含单个元素
解决此问题的一种方法是为每个工作者使用一个连接池,存储在DoFn的静态中。您应该能够在第一次调用startBundle
时初始化它,并跨捆绑包使用它。或者,您可以按需创建连接,并在不再需要时将其释放到池中以供重用
您应该确保静态是线程安全的,并且没有对数据流如何管理捆绑包做出任何假设。正如Davor Bonaci所建议的,使用静态变量解决了问题
public class CassandraWriter<T> extends DoFn<T, Void> {
private static final Logger log = LoggerFactory.getLogger(CassandraWriter.class);
// Prevent multiple threads from creating multiple cluster connection in parallel.
private static transient final Object lock = new Object();
private static transient Cluster cluster;
private static transient Session session;
private static transient MappingManager mappingManager;
private final String[] hosts;
private final int port;
private final String keyspace;
public CassandraWriter(String[] hosts, int port, String keyspace) {
this.hosts = hosts;
this.port = port;
this.keyspace = keyspace;
}
@Override
public void startBundle(Context c) {
synchronized (lock) {
if (cluster == null) {
cluster = Cluster.builder()
.addContactPoints(hosts)
.withPort(port)
.withoutMetrics()
.withoutJMXReporting()
.build();
session = cluster.connect(keyspace);
mappingManager = new MappingManager(session);
}
}
}
@Override
public void processElement(ProcessContext c) throws IOException {
T element = c.element();
Mapper<T> mapper = (Mapper<T>) mappingManager.mapper(element.getClass());
mapper.save(element);
}
}
公共类CassandraWriter扩展了DoFn{
私有静态最终记录器log=LoggerFactory.getLogger(CassandraWriter.class);
//防止多个线程并行创建多个群集连接。
私有静态瞬态最终对象锁=新对象();
私有静态瞬态集群;
私有静态瞬态会话;
私有静态瞬态映射管理器映射管理器;
私有最终字符串[]主机;
私人最终国际港口;
私有最终字符串键空间;
公共CassandraWriter(字符串[]主机、int端口、字符串键空间){
this.hosts=hosts;
this.port=端口;
this.keyspace=keyspace;
}
@凌驾
公共无效开始绑定(上下文c){
已同步(锁定){
if(cluster==null){
cluster=cluster.builder()
.addContactPoints(主机)
.withPort(端口)
.withoutMetrics()
.没有JMXReporting()
.build();
会话=cluster.connect(键空间);
mappingManager=新的mappingManager(会话);
}
}
}
@凌驾
public void processElement(ProcessContext c)引发IOException{
T元素=c元素();
Mapper Mapper=(Mapper)mappingManager.Mapper(element.getClass());
mapper.save(元素);
}
}
谢谢。我在考虑静态,但我有点不愿意使用它,因为它不是4种有文档记录的方法之一。
public class CassandraWriter<T> extends DoFn<T, Void> {
private static final Logger log = LoggerFactory.getLogger(CassandraWriter.class);
// Prevent multiple threads from creating multiple cluster connection in parallel.
private static transient final Object lock = new Object();
private static transient Cluster cluster;
private static transient Session session;
private static transient MappingManager mappingManager;
private final String[] hosts;
private final int port;
private final String keyspace;
public CassandraWriter(String[] hosts, int port, String keyspace) {
this.hosts = hosts;
this.port = port;
this.keyspace = keyspace;
}
@Override
public void startBundle(Context c) {
synchronized (lock) {
if (cluster == null) {
cluster = Cluster.builder()
.addContactPoints(hosts)
.withPort(port)
.withoutMetrics()
.withoutJMXReporting()
.build();
session = cluster.connect(keyspace);
mappingManager = new MappingManager(session);
}
}
}
@Override
public void processElement(ProcessContext c) throws IOException {
T element = c.element();
Mapper<T> mapper = (Mapper<T>) mappingManager.mapper(element.getClass());
mapper.save(element);
}
}