Google bigquery 如何在BigQuery中替换时间戳分区表数据?

Google bigquery 如何在BigQuery中替换时间戳分区表数据?,google-bigquery,Google Bigquery,我试图解决的问题是从一个TIMESTAMPtype列引用的特定分区中删除重复项。我的表类似于下面的模式,时间戳列分区具有基于日期的粒度: requestID:STRING, ts:TIMESTAMP, recordNo:INTEGER, recordData:STRING 现在我有数百万个这样的东西,有时会有这样的复制品: 'server1234', '2020-06-10', 1, apple 'server1234', '2020-06-10', 1, apple 'server1234'

我试图解决的问题是从一个
TIMESTAMP
type列引用的特定分区中删除重复项。我的表类似于下面的模式,时间戳列分区具有基于日期的粒度:

requestID:STRING, ts:TIMESTAMP, recordNo:INTEGER, recordData:STRING
现在我有数百万个这样的东西,有时会有这样的复制品:

'server1234', '2020-06-10', 1, apple
'server1234', '2020-06-10', 1, apple
'server1234', '2020-06-10', 2, orange
'server1234', '2020-06-10', 2, orange
记录的唯一性由两个字段决定:
requestID
recordNo
。我想删除分区中的重复项,其中
CAST(ts AS DATE)='2020-06-10'
。我可以通过简单的选择查看不同的记录:

从mytable中选择DISTINCT*WHERE CAST(ts AS DATE)='2020-06-10'
必须有一种方法将delete/update/merge与selectdistinct组合起来,这样我就可以用消除重复的数据替换分区


想法?

最安全的方法是只选择需要输出到新表中的数据(消除重复),删除永久表中的数据,然后将消除重复的数据插入到永久位置。BigQuery没有使更新/删除方法像某些OLTP数据库那样简单

如果您更喜欢一次性的方法,下面是一个示例,其中包含您提供的数据

-- SETUP
CREATE TABLE working.remove_dupes
(
  requestID STRING,
  ts TIMESTAMP,
  recordNo INT64,
  recordData STRING
)
PARTITION BY TIMESTAMP_TRUNC(ts, HOUR);

INSERT INTO working.remove_dupes(requestID, ts, recordNo, recordData)
VALUES
('server1234', '2020-06-10', 1, 'apple'),
('server1234', '2020-06-10', 1, 'apple'),
('server1234', '2020-06-10', 2, 'orange'),
('server1234', '2020-06-10', 2, 'orange');

------------------------------------------------------------------------------------
-- SELECTING ONLY ONE OF THE ENTRIES (NO DUPLICATES)
SELECT
  requestID,
  ts,
  recordNo,
  recordData
FROM (
  SELECT
    requestID,
    ts,
    recordNo,
    recordData,
    ROW_NUMBER() OVER (PARTITION BY requestID, recordNo ORDER BY ts) AS instance_num
  FROM
    working.remove_dupes
)
WHERE
  instance_num = 1;


------------------------------------------------------------------------------------
-- REPLACE THE ORIGINAL TABLE, REMOVING DUPLICATES IN THE PROCESS
-- BACK UP YOUR TABLE FIRST!!!!! (MAKE A COPY)
CREATE OR REPLACE TABLE working.remove_dupes
PARTITION BY TIMESTAMP_TRUNC(ts, HOUR)
AS
(SELECT
  requestID,
  ts,
  recordNo,
  recordData
FROM (
  SELECT
    requestID,
    ts,
    recordNo,
    recordData,
    ROW_NUMBER() OVER (PARTITION BY requestID, recordNo ORDER BY ts) AS instance_num
  FROM
    working.remove_dupes
)
WHERE
  instance_num = 1);
编辑:请注意,根据我的经验,替换表可能会删除表元数据(描述)以及表分区。我已经更新了这个示例,以包含一个表分区设置