Sql 避免多个子查询的BigQuery
我们正在开发一个应用程序,它将请求存储在一个表中,响应存储在另一个表中(当然)。每个请求可以有多个响应,并将请求ID存储在两个表中 最初,我认为我们可以使用请求->响应的左连接来计算每个匹配条件的总数:Sql 避免多个子查询的BigQuery,sql,google-bigquery,Sql,Google Bigquery,我们正在开发一个应用程序,它将请求存储在一个表中,响应存储在另一个表中(当然)。每个请求可以有多个响应,并将请求ID存储在两个表中 最初,我认为我们可以使用请求->响应的左连接来计算每个匹配条件的总数: SELECT source, COUNT(*) as requests, COUNT(responses.request_id) as responses FROM DATASET.requests LEFT JOIN DATASET.responses ON requests.id = res
SELECT source, COUNT(*) as requests, COUNT(responses.request_id) as responses
FROM DATASET.requests
LEFT JOIN DATASET.responses ON requests.id = responses.request_id
WHERE source = "source1"
GROUP BY source
有70个请求符合WHERE标准,30个响应符合此标准。预期输出为:“source1、70、30”。
此后,我学习了更多关于连接行为的知识,取而代之的是“source1259207”。两边都有重复的ID
我能得到我想要的结果的唯一方法就是创建一个巨大的查询,以及多个完整的子查询,这些子查询在ID集中匹配,并根据给定的条件进行过滤。然后使用过滤的ID集来真正提取我们的字段、统计数据等
SELECT * FROM
(SELECT COUNT(*) as responses FROM DATASET.responses
WHERE id IN (SELECT id FROM DATASET.requests WHERE source =
"source1"))
,
(SELECT source, COUNT(*) as requests
FROM PUBDATA.requests
WHERE id IN (SELECT id FROM DATASET.requests WHERE source = "source1")
GROUP BY source)
这看起来很糟糕。我曾尝试使用CTE收集我们想要的id列表,并在(CTE.id)中使用WHERE id/request_id,但这显然是不可能的,除非我们在CTE上加入,这会再次产生错误和成倍的结果
由于我们希望在查询中添加额外的统计信息,这将需要进一步的WHERE子句,我担心这个怪物将继续增长,并且很难实现
如果有更好的方法,请告诉我。谢谢
根据请求编辑示例模式
请求
id (String), source (String), partner_ids (Integer array), user_agent (String), timestamp (Timestamp), ...
request_id (String, from requests.id), partner_id (Integer), is_billed (boolean), price_charged (float, null if is_billed = false), response_categories (String array, not from requests), ...
响应
id (String), source (String), partner_ids (Integer array), user_agent (String), timestamp (Timestamp), ...
request_id (String, from requests.id), partner_id (Integer), is_billed (boolean), price_charged (float, null if is_billed = false), response_categories (String array, not from requests), ...
挑战在于,我们必须主要查询Requests表以获得与我们的条件匹配的ID值列表,然后为一个合并报告查询每个表上的统计数据(例如计数、在何处计费等)。我们可能还需要从每个表上的条件中提取ID池(例如,where requests.source='source1'和responses.response\u“action”中的类别)我认为您可以使用
union all
和group by
来做您想做的事情:
select source, sum(requests) as requests, sum(responses) as responses
from ((select source, count(*) as requests, 0 as response
from dataset.requests
group by source
) union all
(select source, 0 as requests, count(*) as responses
from dataset.responses
group by source
)
) rr
group by source;
这将对所有源进行计算
编辑:
对于修订版,只需使用附加的连接
:
select source, sum(requests) as requests, sum(responses) as responses
from ((select source, count(*) as requests, 0 as response
from dataset.requests rq
group by rq.source
) union all
(select rq.source, 0 as requests, count(*) as responses
from dataset.responses r join
(select distinct rq.id
from dataset.requests rq
) rq
on r.id = rq.id
group by rq.source
)
) rr
group by source;
如果每个请求最多有一个响应,您可以将其缩短为:
select rq.source, count(*) as requests, count(r.id) as responses
from dataset.requests rq left join
dataset.responses r
on r.id = rq.id
group by rq.source
也许我有点误会,你们为什么不数一数,在id上加入呢
WITH
sources
AS
( SELECT COUNT (*) source_cnt, id
FROM dataset.request
GROUP BY id),
responses
AS
( SELECT COUNT (*) AS response_cnt, id
FROM dataset.responses
GROUP BY id)
SELECT source_cnt, response_cnt, sources.id
FROM sources INNER JOIN responses ON sources.id = responses.id;
如果要保留所有记录,可以将其修改为完全外部联接:
WITH
sources
AS
( SELECT COUNT (*) source_cnt, id
FROM dataset.request
GROUP BY id),
responses
AS
( SELECT COUNT (*) AS response_cnt, id
FROM dataset.responses
GROUP BY id)
SELECT COALESCE (sources.id, responses.id) AS id, source_cnt, response_cnt
FROM sources FULL OUTER JOIN responses ON sources.id = responses.id
老实说,我对你最后想要看到的有点困惑,我也不完全理解你为什么有70个请求,只有30个响应,如果一个请求可以有多个响应。您的意思是某些请求可以有0个响应吗?或者你在计算不同的反应 如果您希望计算请求的总数以及与这些特定请求相关的响应的总数,我相信对代码的这一细微修改应该会起作用:
SELECT source, COUNT(DISTINCT id) as requests, COUNT(responses.request_id) as responses
FROM `dataset.requests` as requests
LEFT JOIN `dataset.responses` as responses ON requests.id = responses.request_id
WHERE source = "source1"
GROUP BY source
样本数据和期望的结果将有所帮助。谢谢你,但我也不认为这会起作用。我补充了一点说明,源字段只存在于requests表中,ID列是它们之间唯一的公共交集。所以它仍然在乘以响应计数,应该是30。@EvanTestvoid。你应该用样本数据来澄清这个问题。你不是在寻找回复的数量。您正在查找带有响应的请求数。这是完全不同的计算,这是不正确的。我们需要一个独立的请求和响应计数,这些请求和响应在预筛选的ID集中具有ID值,然后在任一表中按ID及其相关列值排序。要重新迭代,请求中总共有70个条目,响应中有30个条目。这是每一张桌子的大小。