Google bigquery 为区块链交易创建每日仓位

Google bigquery 为区块链交易创建每日仓位,google-bigquery,Google Bigquery,经过一些操作,我最终得到了GBQ中的一个表,其中列出了区块链上的所有交易(约2.8亿行): 由于此表包含所有事务,如果我将每个用户截至给定日期的所有值相加,我可能会得到他的余额,一旦我有近2200万用户,我希望根据他们拥有的硬币数量对其进行二值化。我使用以下代码浏览了所有数据集: #standardSQL SELECT COUNT(val) AS num, bin FROM ( SELECT val, CASE WHEN val > 0 AND va

经过一些操作,我最终得到了GBQ中的一个表,其中列出了区块链上的所有交易(约2.8亿行):

由于此表包含所有事务,如果我将每个用户截至给定日期的所有值相加,我可能会得到他的余额,一旦我有近2200万用户,我希望根据他们拥有的硬币数量对其进行二值化。我使用以下代码浏览了所有数据集:

#standardSQL
SELECT
  COUNT(val) AS num,
  bin
FROM (
  SELECT
    val,
    CASE
      WHEN val > 0 AND val <= 1 THEN '0_to_1'
      WHEN val > 1
    AND val <= 10 THEN '1_to_10'
      WHEN val > 10 AND val <= 100 THEN '10_to_100'
      WHEN val > 100
    AND val <= 1000 THEN '100_to_1000'
      WHEN val > 1000 AND val <= 10000 THEN '1000_to_10000'
      WHEN val > 10000 THEN 'More_10000'
    END AS bin
  FROM (
    SELECT
        max(timestamp),
        receiver,
        SUM(value) as val
      FROM
        `table.transactions`
      WHERE
        timestamp < '2011-02-12 00:00:00'
      group by
        receiver))
GROUP BY
  bin
现在,我想在每天结束时遍历事务表的行,检查每个bin中的用户数。最后的表格应该是这样的:

+-------------------------+---------+-----------+-----------+-------------+---------------+------------+
|           timestamp     | 0_to_1  |  1_to_10  | 10_to_100 | 100_to_1000 | 1000_to_10000 | More_10000 |
+-------------------------+---------+-----------+-----------+-------------+---------------+------------+
| 2009-01-09 00:00:00 UTC | 1       | 1         | 0         | 0           | 0             | 0          |
| 2009-01-10 00:00:00 UTC | 0       | 2         | 0         | 0           | 0             | 0          |
| ...                     | ...     | ...       | ...       | ...         | ...           | ...        |
| 2018-09-10 00:00:00 UTC | 2342823 | 124324325 | 43251315  | 234523555   | 2352355556    | 12124235231|
+-------------------------+---------+-----------+-----------+-------------+---------------+------------+
由于数据集太大,我无法按时间戳订购,以使我的生活更轻松,因此我希望能有一些想法。例如,我想知道是否有某种方法可以通过分页来提高性能和节省资源。我听说过,但不知道如何使用它

提前谢谢



更新:经过一些工作,现在我有了一个按时间戳排序的事务表。

下面的查询应该会按时间戳为您提供每个bin内的事务计数。现在,请记住,此查询将在行级别评估事务的值

SELECT
  timestamp,
    COUNT(DISTINCT(CASE
      WHEN value > 0 AND value <= 1 THEN receiver
    END))  AS _0_to_1,
    COUNT(DISTINCT(CASE
      WHEN value > 1 AND value <= 10 THEN receiver
    END)) AS _1_to_10,
    COUNT(DISTINCT(CASE
      WHEN value > 10 AND value <= 100 THEN receiver
    END)) AS _10_to_100,
    COUNT(DISTINCT(CASE
      WHEN value > 100 AND value <= 1000 THEN receiver
    END)) AS _100_to_1000,
    COUNT(DISTINCT(CASE
      WHEN value > 1000 AND value <= 10000 THEN receiver
    END)) AS _1000_to_100000,
    COUNT(DISTINCT(CASE
      WHEN value > 10000 THEN receiver
    END)) AS More_10000
FROM `table.transactions`
WHERE timestamp = TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 DAY)
GROUP BY 1
选择
时间戳,
计数(区分大小写)
当值>0、值1、值10、值100、值1000和值10000时,则接收器
完)超过10000欧元
从`表.事务`
其中timestamp=timestamp\u SUB(当前时间戳(),间隔1天)
按1分组
关于性能方面的问题,您可能希望探索的一个领域(如果可能的话)是创建这个大表的分区版本。这将帮助您1)提高性能,2)降低查询特定数据范围的数据的成本。你可以找到更多信息

编辑

我在查询中添加了一个
WHERE
子句以过滤前一天的内容。例如,我假设您今天将运行查询以获取前一天的数据。现在,您可能需要通过添加额外的
TIMESTAMP\u SUB(..,INTERVAL X HOUR
TIMESTAMP\u ADD(..,INTERVAL X HOUR
),将
CURRENT_TIMESTAMP()
调整到您的时区,其中X是需要减去或添加的小时数,以匹配您正在分析的数据的时区


此外,根据字段的类型,您可能需要
CAST(timestamp AS timestamp)

这将计算与存储箱匹配的值,对吗?我需要将值作为之前所有值的总和。它将计算与存储箱匹配的值的数量,对。您的确切含义是什么“之前所有值的总和”?只需重新阅读您的问题,发现您需要每个bin中的用户数。只需更新答案。最后,在研究脚本几个小时后,我发现了解每个帐户在给定日期的确切余额是不可行的(至少在合理的时间窗口内是不可行的).所以我会用你的答案作为近似值。谢谢!
+-------------------------+---------+-----------+-----------+-------------+---------------+------------+
|           timestamp     | 0_to_1  |  1_to_10  | 10_to_100 | 100_to_1000 | 1000_to_10000 | More_10000 |
+-------------------------+---------+-----------+-----------+-------------+---------------+------------+
| 2009-01-09 00:00:00 UTC | 1       | 1         | 0         | 0           | 0             | 0          |
| 2009-01-10 00:00:00 UTC | 0       | 2         | 0         | 0           | 0             | 0          |
| ...                     | ...     | ...       | ...       | ...         | ...           | ...        |
| 2018-09-10 00:00:00 UTC | 2342823 | 124324325 | 43251315  | 234523555   | 2352355556    | 12124235231|
+-------------------------+---------+-----------+-----------+-------------+---------------+------------+
SELECT
  timestamp,
    COUNT(DISTINCT(CASE
      WHEN value > 0 AND value <= 1 THEN receiver
    END))  AS _0_to_1,
    COUNT(DISTINCT(CASE
      WHEN value > 1 AND value <= 10 THEN receiver
    END)) AS _1_to_10,
    COUNT(DISTINCT(CASE
      WHEN value > 10 AND value <= 100 THEN receiver
    END)) AS _10_to_100,
    COUNT(DISTINCT(CASE
      WHEN value > 100 AND value <= 1000 THEN receiver
    END)) AS _100_to_1000,
    COUNT(DISTINCT(CASE
      WHEN value > 1000 AND value <= 10000 THEN receiver
    END)) AS _1000_to_100000,
    COUNT(DISTINCT(CASE
      WHEN value > 10000 THEN receiver
    END)) AS More_10000
FROM `table.transactions`
WHERE timestamp = TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 DAY)
GROUP BY 1