Google bigquery BigQuery-删除重复记录有时需要很长时间

Google bigquery BigQuery-删除重复记录有时需要很长时间,google-bigquery,google-cloud-platform,Google Bigquery,Google Cloud Platform,我们在云中实现了以下ETL过程:每小时在本地数据库中运行一次查询=>将结果保存为csv并加载到云存储=>将文件从云存储加载到BigQuery表=>使用以下查询删除重复记录 SELECT * EXCEPT (row_number) FROM ( SELECT *, ROW_NUMBER() OVER (PARTITION BY id ORDER BY timestamp DESC) row_number FROM rawData.stock_movement )

我们在云中实现了以下ETL过程:每小时在本地数据库中运行一次查询=>将结果保存为csv并加载到云存储=>将文件从云存储加载到BigQuery表=>使用以下查询删除重复记录

SELECT 
  * EXCEPT (row_number)
FROM (
  SELECT 
    *,
    ROW_NUMBER() OVER (PARTITION BY id ORDER BY timestamp DESC) row_number 
  FROM rawData.stock_movement
)
WHERE row_number = 1
从今天早上8点(柏林当地时间)开始,删除重复记录的过程比平时要长得多,甚至数据量也与平时没有太大区别:删除重复记录通常需要10秒,而今天早上有时需要半小时


删除重复记录的性能不稳定吗?

可能是因为特定的
id
有许多重复值,因此计算行号需要很长时间。如果要检查是否存在这种情况,可以尝试:

#standardSQL
SELECT id, COUNT(*) AS id_count
FROM rawData.stock_movement
GROUP BY id
ORDER BY id_count DESC LIMIT 5;
尽管如此,使用此查询删除重复项可能会更快:

#standardSQL
SELECT latest_row.*
FROM (
  SELECT ARRAY_AGG(t ORDER BY timestamp DESC LIMIT 1)[OFFSET(0)] AS latest_row
  FROM rawData.stock_movement AS t
  GROUP BY t.id
);
以下是一个例子:

#standardSQL
WITH T AS (
  SELECT 1 AS id, 'foo' AS x, TIMESTAMP '2017-04-01' AS timestamp UNION ALL
  SELECT 2, 'bar', TIMESTAMP '2017-04-02' UNION ALL
  SELECT 1, 'baz', TIMESTAMP '2017-04-03')
SELECT latest_row.*
FROM (
  SELECT ARRAY_AGG(t ORDER BY timestamp DESC LIMIT 1)[OFFSET(0)] AS latest_row
  FROM rawData.stock_movement AS t
  GROUP BY t.id
);

这可能更快的原因是BigQuery只会在任何给定的时间点保留内存中时间戳最大的行。

可能是特定的
id
有许多重复值,因此计算行号需要很长时间。如果要检查是否存在这种情况,可以尝试:

#standardSQL
SELECT id, COUNT(*) AS id_count
FROM rawData.stock_movement
GROUP BY id
ORDER BY id_count DESC LIMIT 5;
尽管如此,使用此查询删除重复项可能会更快:

#standardSQL
SELECT latest_row.*
FROM (
  SELECT ARRAY_AGG(t ORDER BY timestamp DESC LIMIT 1)[OFFSET(0)] AS latest_row
  FROM rawData.stock_movement AS t
  GROUP BY t.id
);
以下是一个例子:

#standardSQL
WITH T AS (
  SELECT 1 AS id, 'foo' AS x, TIMESTAMP '2017-04-01' AS timestamp UNION ALL
  SELECT 2, 'bar', TIMESTAMP '2017-04-02' UNION ALL
  SELECT 1, 'baz', TIMESTAMP '2017-04-03')
SELECT latest_row.*
FROM (
  SELECT ARRAY_AGG(t ORDER BY timestamp DESC LIMIT 1)[OFFSET(0)] AS latest_row
  FROM rawData.stock_movement AS t
  GROUP BY t.id
);
这可能更快的原因是BigQuery只会在任何给定的时间点保留内存中时间戳最大的行