Google bigquery 为区块链交易创建每日仓位
经过一些操作,我最终得到了GBQ中的一个表,其中列出了区块链上的所有交易(约2.8亿行): 由于此表包含所有事务,如果我将每个用户截至给定日期的所有值相加,我可能会得到他的余额,一旦我有近2200万用户,我希望根据他们拥有的硬币数量对其进行二值化。我使用以下代码浏览了所有数据集:Google bigquery 为区块链交易创建每日仓位,google-bigquery,Google Bigquery,经过一些操作,我最终得到了GBQ中的一个表,其中列出了区块链上的所有交易(约2.8亿行): 由于此表包含所有事务,如果我将每个用户截至给定日期的所有值相加,我可能会得到他的余额,一旦我有近2200万用户,我希望根据他们拥有的硬币数量对其进行二值化。我使用以下代码浏览了所有数据集: #standardSQL SELECT COUNT(val) AS num, bin FROM ( SELECT val, CASE WHEN val > 0 AND va
#standardSQL
SELECT
COUNT(val) AS num,
bin
FROM (
SELECT
val,
CASE
WHEN val > 0 AND val <= 1 THEN '0_to_1'
WHEN val > 1
AND val <= 10 THEN '1_to_10'
WHEN val > 10 AND val <= 100 THEN '10_to_100'
WHEN val > 100
AND val <= 1000 THEN '100_to_1000'
WHEN val > 1000 AND val <= 10000 THEN '1000_to_10000'
WHEN val > 10000 THEN 'More_10000'
END AS bin
FROM (
SELECT
max(timestamp),
receiver,
SUM(value) as val
FROM
`table.transactions`
WHERE
timestamp < '2011-02-12 00:00:00'
group by
receiver))
GROUP BY
bin
现在,我想在每天结束时遍历事务表的行,检查每个bin中的用户数。最后的表格应该是这样的:
+-------------------------+---------+-----------+-----------+-------------+---------------+------------+
| timestamp | 0_to_1 | 1_to_10 | 10_to_100 | 100_to_1000 | 1000_to_10000 | More_10000 |
+-------------------------+---------+-----------+-----------+-------------+---------------+------------+
| 2009-01-09 00:00:00 UTC | 1 | 1 | 0 | 0 | 0 | 0 |
| 2009-01-10 00:00:00 UTC | 0 | 2 | 0 | 0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... |
| 2018-09-10 00:00:00 UTC | 2342823 | 124324325 | 43251315 | 234523555 | 2352355556 | 12124235231|
+-------------------------+---------+-----------+-----------+-------------+---------------+------------+
由于数据集太大,我无法按时间戳订购,以使我的生活更轻松,因此我希望能有一些想法。例如,我想知道是否有某种方法可以通过分页来提高性能和节省资源。我听说过,但不知道如何使用它
提前谢谢
更新:经过一些工作,现在我有了一个按时间戳排序的事务表。下面的查询应该会按时间戳为您提供每个bin内的事务计数。现在,请记住,此查询将在行级别评估事务的值
SELECT
timestamp,
COUNT(DISTINCT(CASE
WHEN value > 0 AND value <= 1 THEN receiver
END)) AS _0_to_1,
COUNT(DISTINCT(CASE
WHEN value > 1 AND value <= 10 THEN receiver
END)) AS _1_to_10,
COUNT(DISTINCT(CASE
WHEN value > 10 AND value <= 100 THEN receiver
END)) AS _10_to_100,
COUNT(DISTINCT(CASE
WHEN value > 100 AND value <= 1000 THEN receiver
END)) AS _100_to_1000,
COUNT(DISTINCT(CASE
WHEN value > 1000 AND value <= 10000 THEN receiver
END)) AS _1000_to_100000,
COUNT(DISTINCT(CASE
WHEN value > 10000 THEN receiver
END)) AS More_10000
FROM `table.transactions`
WHERE timestamp = TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 DAY)
GROUP BY 1
选择
时间戳,
计数(区分大小写)
当值>0、值1、值10、值100、值1000和值10000时,则接收器
完)超过10000欧元
从`表.事务`
其中timestamp=timestamp\u SUB(当前时间戳(),间隔1天)
按1分组
关于性能方面的问题,您可能希望探索的一个领域(如果可能的话)是创建这个大表的分区版本。这将帮助您1)提高性能,2)降低查询特定数据范围的数据的成本。你可以找到更多信息
编辑
我在查询中添加了一个WHERE
子句以过滤前一天的内容。例如,我假设您今天将运行查询以获取前一天的数据。现在,您可能需要通过添加额外的TIMESTAMP\u SUB(..,INTERVAL X HOUR
或TIMESTAMP\u ADD(..,INTERVAL X HOUR
),将CURRENT_TIMESTAMP()
调整到您的时区,其中X是需要减去或添加的小时数,以匹配您正在分析的数据的时区
此外,根据字段的类型,您可能需要
CAST(timestamp AS timestamp)
。这将计算与存储箱匹配的值,对吗?我需要将值作为之前所有值的总和。它将计算与存储箱匹配的值的数量,对。您的确切含义是什么“之前所有值的总和”?只需重新阅读您的问题,发现您需要每个bin中的用户数。只需更新答案。最后,在研究脚本几个小时后,我发现了解每个帐户在给定日期的确切余额是不可行的(至少在合理的时间窗口内是不可行的).所以我会用你的答案作为近似值。谢谢!
+-------------------------+---------+-----------+-----------+-------------+---------------+------------+
| timestamp | 0_to_1 | 1_to_10 | 10_to_100 | 100_to_1000 | 1000_to_10000 | More_10000 |
+-------------------------+---------+-----------+-----------+-------------+---------------+------------+
| 2009-01-09 00:00:00 UTC | 1 | 1 | 0 | 0 | 0 | 0 |
| 2009-01-10 00:00:00 UTC | 0 | 2 | 0 | 0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... |
| 2018-09-10 00:00:00 UTC | 2342823 | 124324325 | 43251315 | 234523555 | 2352355556 | 12124235231|
+-------------------------+---------+-----------+-----------+-------------+---------------+------------+
SELECT
timestamp,
COUNT(DISTINCT(CASE
WHEN value > 0 AND value <= 1 THEN receiver
END)) AS _0_to_1,
COUNT(DISTINCT(CASE
WHEN value > 1 AND value <= 10 THEN receiver
END)) AS _1_to_10,
COUNT(DISTINCT(CASE
WHEN value > 10 AND value <= 100 THEN receiver
END)) AS _10_to_100,
COUNT(DISTINCT(CASE
WHEN value > 100 AND value <= 1000 THEN receiver
END)) AS _100_to_1000,
COUNT(DISTINCT(CASE
WHEN value > 1000 AND value <= 10000 THEN receiver
END)) AS _1000_to_100000,
COUNT(DISTINCT(CASE
WHEN value > 10000 THEN receiver
END)) AS More_10000
FROM `table.transactions`
WHERE timestamp = TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 DAY)
GROUP BY 1