Sql 如何优化将表与其自身连接起来的红移查询?
假设我有一个事务表Sql 如何优化将表与其自身连接起来的红移查询?,sql,amazon-redshift,query-optimization,Sql,Amazon Redshift,Query Optimization,假设我有一个事务表 CREATE TABLE IF NOT EXISTS txn_raw ( transaction_id VARCHAR(60), sport_label VARCHAR(300), family_label VARCHAR(150), item_label VARCHAR(150) ) DISTKEY (the_transaction_id) SORTKEY (the_transaction_id, sport_label, family_label, item_label)
CREATE TABLE IF NOT EXISTS txn_raw (
transaction_id VARCHAR(60),
sport_label VARCHAR(300),
family_label VARCHAR(150),
item_label VARCHAR(150)
)
DISTKEY (the_transaction_id)
SORTKEY (the_transaction_id, sport_label, family_label, item_label)
;
COMMIT;
我想优化以下查询,以便计算项目之间的相关性
SELECT
a.sport_label as sport_label_a,
a.family_label as family_label_a,
a.dsm_label as dsm_label_a,
b.sport_label as sport_label_b,
b.family_label as family_label_b,
b.dsm_label as dsm_label_b,
count(distinct a.the_transaction_id) as txn_ab
FROM txn_raw a
JOIN txn_raw b
on a.the_transaction_id=b.the_transaction_id
and a.sport_label != b.sport_label
and a.family_label != b.family_label
and a.item_label != b.item_label
group by 1,2,3,4,5,6
我正在考虑在将txn_raw与自身连接后创建一个临时表来存储数据。
然后查询临时表并按分组
有没有更好的方法来优化这种查询?我建议在加入之前提取不同的值,而不是之后:
WITH r as (
SELECT DISTINCT the_transaction_id, sport_label, family_label, item_label
FROM txn_raw
)
SELECT a.sport_label as sport_label_a,
a.family_label as family_label_a,
a.dsm_label as dsm_label_a,
b.sport_label as sport_label_b,
b.family_label as family_label_b,
b.dsm_label as dsm_label_b,
COUNT(*) as txn_ab
FROM r a JOIN
r b
ON a.the_transaction_id = b.the_transaction_id AND
a.sport_label <> b.sport_label AND
a.family_label <> b.family_label AND
a.item_label <> b.item_label
GROUP BY 1,2,3,4,5,6;
那就是。任何一列都不同,但不是所有列。您似乎已经有一个涵盖所有4列的索引。是吗?请提供示例数据和所需结果。您应该查看并提供查询的解释计划和实际执行时间。因为你问了一个优化问题,我想查询时间太长了。第一个问题是为什么。知道了这一点,接下来该怎么办了。谢谢!我可以知道为什么最好先做区分的原因吗?@LouisLaw。防止行的增加,这只会降低后续处理的速度。
FROM r a JOIN
r b
ON a.the_transaction_id = b.the_transaction_id AND
NOT (a.sport_label = b.sport_label AND
a.family_label = b.family_label AND
a.item_label = b.item_label
)