Google bigquery 如何在BigQuery中替换时间戳分区表数据?
我试图解决的问题是从一个Google bigquery 如何在BigQuery中替换时间戳分区表数据?,google-bigquery,Google Bigquery,我试图解决的问题是从一个TIMESTAMPtype列引用的特定分区中删除重复项。我的表类似于下面的模式,时间戳列分区具有基于日期的粒度: requestID:STRING, ts:TIMESTAMP, recordNo:INTEGER, recordData:STRING 现在我有数百万个这样的东西,有时会有这样的复制品: 'server1234', '2020-06-10', 1, apple 'server1234', '2020-06-10', 1, apple 'server1234'
TIMESTAMP
type列引用的特定分区中删除重复项。我的表类似于下面的模式,时间戳列分区具有基于日期的粒度:
requestID:STRING, ts:TIMESTAMP, recordNo:INTEGER, recordData:STRING
现在我有数百万个这样的东西,有时会有这样的复制品:
'server1234', '2020-06-10', 1, apple
'server1234', '2020-06-10', 1, apple
'server1234', '2020-06-10', 2, orange
'server1234', '2020-06-10', 2, orange
记录的唯一性由两个字段决定:requestID
和recordNo
。我想删除分区中的重复项,其中CAST(ts AS DATE)='2020-06-10'
。我可以通过简单的选择查看不同的记录:
从mytable中选择DISTINCT*WHERE CAST(ts AS DATE)='2020-06-10'
必须有一种方法将delete/update/merge与selectdistinct组合起来,这样我就可以用消除重复的数据替换分区
想法?最安全的方法是只选择需要输出到新表中的数据(消除重复),删除永久表中的数据,然后将消除重复的数据插入到永久位置。BigQuery没有使更新/删除方法像某些OLTP数据库那样简单 如果您更喜欢一次性的方法,下面是一个示例,其中包含您提供的数据
-- SETUP
CREATE TABLE working.remove_dupes
(
requestID STRING,
ts TIMESTAMP,
recordNo INT64,
recordData STRING
)
PARTITION BY TIMESTAMP_TRUNC(ts, HOUR);
INSERT INTO working.remove_dupes(requestID, ts, recordNo, recordData)
VALUES
('server1234', '2020-06-10', 1, 'apple'),
('server1234', '2020-06-10', 1, 'apple'),
('server1234', '2020-06-10', 2, 'orange'),
('server1234', '2020-06-10', 2, 'orange');
------------------------------------------------------------------------------------
-- SELECTING ONLY ONE OF THE ENTRIES (NO DUPLICATES)
SELECT
requestID,
ts,
recordNo,
recordData
FROM (
SELECT
requestID,
ts,
recordNo,
recordData,
ROW_NUMBER() OVER (PARTITION BY requestID, recordNo ORDER BY ts) AS instance_num
FROM
working.remove_dupes
)
WHERE
instance_num = 1;
------------------------------------------------------------------------------------
-- REPLACE THE ORIGINAL TABLE, REMOVING DUPLICATES IN THE PROCESS
-- BACK UP YOUR TABLE FIRST!!!!! (MAKE A COPY)
CREATE OR REPLACE TABLE working.remove_dupes
PARTITION BY TIMESTAMP_TRUNC(ts, HOUR)
AS
(SELECT
requestID,
ts,
recordNo,
recordData
FROM (
SELECT
requestID,
ts,
recordNo,
recordData,
ROW_NUMBER() OVER (PARTITION BY requestID, recordNo ORDER BY ts) AS instance_num
FROM
working.remove_dupes
)
WHERE
instance_num = 1);
编辑:请注意,根据我的经验,替换表可能会删除表元数据(描述)以及表分区。我已经更新了这个示例,以包含一个表分区设置