Postgresql kafka connect jdbc不从源获取连续的时间戳_Postgresql_Jdbc_Apache Kafka_Apache Kafka Connect_Confluent Platform

Postgresql kafka connect jdbc不从源获取连续的时间戳

postgresql jdbc apache-kafka

Postgresql kafka connect jdbc不从源获取连续的时间戳,postgresql,jdbc,apache-kafka,apache-kafka-connect,confluent-platform,Postgresql,Jdbc,Apache Kafka,Apache Kafka Connect,Confluent Platform,我使用kafka-connect-jdbc-4.0.0.jar和postgresql-9.4-1206-jdbc41.jar kafka connect连接器的配置 { "connector.class": "io.confluent.connect.jdbc.JdbcSourceConnector", "mode": "timestamp", "timestamp.column.name": "updated_at", "topic.prefix": "streaming.da

我使用kafka-connect-jdbc-4.0.0.jar和postgresql-9.4-1206-jdbc41.jar

kafka connect连接器的配置

{
  "connector.class": "io.confluent.connect.jdbc.JdbcSourceConnector",
  "mode": "timestamp",
  "timestamp.column.name": "updated_at",
  "topic.prefix": "streaming.data.v2",
  "connection.password": "password",
  "connection.user": "user",
  "schema.pattern": "test",
  "query": "select * from view_source",
  "connection.url": "jdbc:postgresql://host:5432/test?currentSchema=test"
}

我已经使用jdbc驱动程序针对postgresql数据库postgresql 9.6.9配置了两个连接器，一个是源连接器，另一个是接收器连接器一切正常

我怀疑连接器如何收集源数据，查看日志，我发现执行查询之间有21秒的时差

11/1/2019 9:20:18[2019-01-11 08:20:18,985] DEBUG Checking for next block of results from TimestampIncrementingTableQuerier{name='null', query='select * from view_source', topicPrefix='streaming.data.v2', timestampColumn='updated_at', incrementingColumn='null'} (io.confluent.connect.jdbc.source.JdbcSourceTask)
11/1/2019 9:20:18[2019-01-11 08:20:18,985] DEBUG TimestampIncrementingTableQuerier{name='null', query='select * from view_source', topicPrefix='streaming.data.v2', timestampColumn='updated_at', incrementingColumn='null'} prepared SQL query: select * from view_source WHERE "updated_at" > ? AND "updated_at" < ? ORDER BY "updated_at" ASC (io.confluent.connect.jdbc.source.TimestampIncrementingTableQuerier)
11/1/2019 9:20:18[2019-01-11 08:20:18,985] DEBUG executing query select CURRENT_TIMESTAMP; to get current time from database (io.confluent.connect.jdbc.util.JdbcUtils)
11/1/2019 9:20:18[2019-01-11 08:20:18,985] DEBUG Executing prepared statement with timestamp value = 2019-01-11 08:17:07.000 end time = 2019-01-11 08:20:18.985 (io.confluent.connect.jdbc.source.TimestampIncrementingTableQuerier)
11/1/2019 9:20:19[2019-01-11 08:20:19,070] DEBUG Resetting querier TimestampIncrementingTableQuerier{name='null', query='select * from view_source', topicPrefix='streaming.data.v2', timestampColumn='updated_at', incrementingColumn='null'} (io.confluent.connect.jdbc.source.JdbcSourceTask)

11/1/2019 9:20:49[2019-01-11 08:20:49,499] DEBUG Checking for next block of results from TimestampIncrementingTableQuerier{name='null', query='select * from view_source', topicPrefix='streaming.data.v2', timestampColumn='updated_at', incrementingColumn='null'} (io.confluent.connect.jdbc.source.JdbcSourceTask)
11/1/2019 9:20:49[2019-01-11 08:20:49,500] DEBUG TimestampIncrementingTableQuerier{name='null', query='select * from view_source', topicPrefix='streaming.data.v2', timestampColumn='updated_at', incrementingColumn='null'} prepared SQL query: select * from view_source WHERE "updated_at" > ? AND "updated_at" < ? ORDER BY "updated_at" ASC (io.confluent.connect.jdbc.source.TimestampIncrementingTableQuerier)
11/1/2019 9:20:49[2019-01-11 08:20:49,500] DEBUG executing query select CURRENT_TIMESTAMP; to get current time from database (io.confluent.connect.jdbc.util.JdbcUtils)
11/1/2019 9:20:49[2019-01-11 08:20:49,500] DEBUG Executing prepared statement with timestamp value = 2019-01-11 08:20:39.000 end time = 2019-01-11 08:20:49.500 (io.confluent.connect.jdbc.source.TimestampIncrementingTableQuerier)

我假设其中一个数据是最后获得的记录，另一个值是时刻的时间戳

我找不到对此的解释接头是否正常工作？

您是否应该假设并不总是收集所有信息？

JDBC连接器不一定能检索到每一条消息。为此，您需要基于日志的变更数据捕获。由Debezium和Kafka Connect提供的Postgres。你可以阅读更多关于这方面的内容

免责声明：我为Confluent工作，并撰写了上述博客

编辑：这也是ApacheCon2020上的上述博客的记录：JDBC连接器不保证能够检索到每一条消息。为此，您需要基于日志的变更数据捕获。由Debezium和Kafka Connect提供的Postgres。你可以阅读更多关于这方面的内容

免责声明：我为Confluent工作，并撰写了上述博客

编辑：这是ApacheCon 2020上的上述博客的一段录音：非常感谢您的回复，我非常重视它。但是，为什么连接器不能保证每行的传输？都不使用递增模式？还是时间戳+递增？我应该明确使用Debezium吗？这是基于轮询的方法的一个隐式方面，您无法阻止在两次轮询尝试之间发生多个更新，在这种情况下，不会捕获第一个更新。Debezium所做的基于日志的CDC避免了这一点，因为它从DBs只附加日志文件中获取所有更改免责声明：我为Debezium做出了贡献。@Gunnar。如果我只对JDBC连接器使用递增列模式，并且我每1小时轮询一次，有没有理由相信我会错过任何新记录？1/2我同意这是一个很好的问题。例如，如果连接器在13:30:01运行，但表中最近一行的时间为13:29:40，我认为您应该可以，因为后续运行将查询my_timestamp_col>“…13:29:40”的位置。对吗？2/2但是，如果连接器在14:30:01运行，并且表中最近的一行具有时间戳14:30:01，那么在连接器运行之后，将写入一个新行，该新行也具有时间戳14:30:01。这一新行在14:30:01运行时将丢失，因为它在几分之一秒后才存在，然后它将在后续运行中被跳过，因为这一后续运行将在其上显示my_timestamp_col>'14:30:01'，对吗？非常感谢您的响应，但是我非常重视它，为什么连接器不能保证每行的运输？都不使用递增模式？还是时间戳+递增？我应该明确使用Debezium吗？这是基于轮询的方法的一个隐式方面，您无法阻止在两次轮询尝试之间发生多个更新，在这种情况下，不会捕获第一个更新。Debezium所做的基于日志的CDC避免了这一点，因为它从DBs只附加日志文件中获取所有更改免责声明：我为Debezium做出了贡献。@Gunnar。如果我只对JDBC连接器使用递增列模式，并且我每1小时轮询一次，有没有理由相信我会错过任何新记录？1/2我同意这是一个很好的问题。例如，如果连接器在13:30:01运行，但表中最近一行的时间为13:29:40，我认为您应该可以，因为后续运行将查询my_timestamp_col>“…13:29:40”的位置。对吗？2/2但是，如果连接器在14:30:01运行，并且表中最近的一行具有时间戳14:30:01，那么在连接器运行之后，将写入一个新行，该新行也具有时间戳14:30:01。这一新行在14:30:01运行时将丢失，因为它在几分之一秒后才存在，然后它将在后续运行中被跳过，因为这一后续运行将在其上显示my_timestamp_col>'14:30:01'，对吗？

11/1/2019 9:20:18[2019-01-11 08:20:18,985] DEBUG Executing prepared statement with timestamp value = 2019-01-11 08:17:07.000 end time = 2019-01-11 08:20:18.985 
11/1/2019 9:20:49[2019-01-11 08:20:49,500] DEBUG Executing prepared statement with timestamp value = 2019-01-11 08:20:39.000 end time = 2019-01-11 08:20:49.500