Google bigquery BigQuery-删除语句以删除重复项

Google bigquery BigQuery-删除语句以删除重复项,google-bigquery,Google Bigquery,在SQL上有很多优秀的帖子,它们选择唯一的行并写入截断一个表,从而删除DU。e、 g WITH ev AS ( SELECT *, ROW_NUMBER() OVER (PARTITION BY id ORDER BY loadTime DESC) AS rowNum FROM `duplicates` ) SELECT * EXCEPT(rowNum) FROM ev WHERE rowNum = 1 我尝试使用DML和DELETE稍微不同地探索这个问题,例

在SQL上有很多优秀的帖子,它们选择唯一的行并写入截断一个表,从而删除DU。e、 g

WITH ev AS (
  SELECT
    *,
    ROW_NUMBER() OVER (PARTITION BY id ORDER BY loadTime DESC) AS rowNum
  FROM `duplicates`
 )
SELECT
  * EXCEPT(rowNum)
FROM
  ev
WHERE rowNum = 1
我尝试使用DML和DELETE稍微不同地探索这个问题,例如,如果您不想使用BQ savedQuery,只需执行SQL即可。我想做的大致是:

WITH dup_events AS (
  SELECT
        *,
        ROW_NUMBER() OVER (PARTITION BY id ORDER BY loadTime DESC) AS rowNum
      FROM `duplicates`
 )
DELETE FROM
  dup_events
WHERE rowNum > 1
但在控制台中出现以下错误:

Syntax error: Expected "(" or keyword SELECT but got keyword DELETE at [10:1]
可以使用DELETE实现标准SQL吗

谢谢

从中,要删除的参数必须是一个表,并且没有使用WITH子句的规定。这是有道理的,因为您不能从本质上是逻辑视图的CTE中删除。您可以通过将逻辑放入过滤器中来表达您想要的内容,例如

DELETE
FROM duplicates AS d
WHERE (SELECT ROW_NUMBER() OVER (PARTITION BY id ORDER BY loadTime DESC)
       FROM `duplicates` AS d2
       WHERE d.id = d2.id AND d.loadTime = d2.loadTime) > 1;

下面是:o的工作原理

标准SQL 从“yourproject.yourdataset.duplicates”中删除` 其中STRUCTid、loadTime不在 选择作为结构id MAXloadTime loadTime 来自“yourproject.yourdataset.duplicates” 按id分组
注意:它假设loadTime也是唯一的-这意味着如果给定id有多条记录具有最新的loadTime-它们都将被保留

这必须是最简单的方法:

create or replace table `myproject.mydataset.duplicates` as (
select distinct *
from `myproject.mydataset.duplicates`)
如果您有数组数据类型,请尝试以下操作:

-- build a test table with a duplicate and an array datatype column --
create or replace table DW.pmoTest as (
select 1 as ID, 'peter' as firstname,ARRAY<INT64>[1, 2, 3]  as int_array, current_date as createdate
union all
select 1 as ID, 'peter' as firstname,ARRAY<INT64>[1, 7, 3] as int_array, current_date as createdate
union all
select 2 as ID, 'chamri' as firstname,ARRAY<INT64>[1, 2, 39, 4] as int_array, current_date as createdate
);

-- recreate table without duplicate row
create or replace table DW.pmoTest as (
SELECT col.* FROM (
  SELECT ARRAY_AGG(tbl ORDER BY createdate LIMIT 1)[OFFSET(0)]  col
  FROM DW.pmoTest tbl
  GROUP BY ID
  )
);

上述答案仅适用于小尺寸的桌子。如果您有一个较大的分区表,并且只希望删除给定范围内的重复项,请使用下面的SQL:

-- WARNING: back up the table before this operation
-- FOR large size timestamp partitioned table 
-- -------------------------------------------
-- -- To de-duplicate rows of a given range of a partition table, using surrage_key as unique id
-- -------------------------------------------

DECLARE dt_start DEFAULT TIMESTAMP("2019-09-17T00:00:00", "America/Los_Angeles") ;
DECLARE dt_end DEFAULT TIMESTAMP("2019-09-22T00:00:00", "America/Los_Angeles");

MERGE INTO `gcp_project`.`data_set`.`the_table` AS INTERNAL_DEST
USING (
  SELECT k.*
  FROM (
    SELECT ARRAY_AGG(original_data LIMIT 1)[OFFSET(0)] k 
    FROM `gcp_project`.`data_set`.`the_table` AS original_data
    WHERE stamp BETWEEN dt_start AND dt_end
    GROUP BY surrogate_key
  )

) AS INTERNAL_SOURCE
ON FALSE

WHEN NOT MATCHED BY SOURCE
  AND INTERNAL_DEST.stamp BETWEEN dt_start AND dt_end -- remove all data in partiion range
    THEN DELETE

WHEN NOT MATCHED THEN INSERT ROW

信用证:

不知何故,我认为where子句中不允许使用分析函数。我现在正在进行中,无法自己检查-您能确认它是否真的有效吗?在权衡方面,您对使用此方法与选择和覆盖的看法如何?DML配额当然是一个,很快这将在beta版的分区表Im上得到支持。有一件事似乎是有利的,那就是数据丢失的风险——我不太确定使用另一种方法时,如果您在执行重复数据消除查询的同时让另一个进程加载了被重复数据消除的表,那么执行这样的重复数据消除查询可能需要几分钟时间——如果查询和加载时间过长,是否存在丢失新加载数据的风险你错排了吗?对此有何想法?干杯实际上,我刚刚尝试了这个,得到了这个错误-错误:在[5:4]的WHERE子句中不允许使用分析函数,我认为它仍然可以工作,但我需要尝试运行一个实际的查询:让我看看……对不起,我没有一个好的环境来测试它。没有笔记本电脑,而且很难从我的手机上看到:你能看到这个编辑是否有效吗?它并没有解析函数方法那么有效,但它至少可以工作……我在将部分放在括号中时遇到了这个错误:数组类型的列实例不能用于SELECT DISTINCT这太好了……谢谢。