Google bigquery Bigquery重叠值分布查询

Google bigquery Bigquery重叠值分布查询,google-bigquery,Google Bigquery,我想知道是否有可能(以及如何)在值的范围内获得最小的不同值重叠 例如,假设我有3个XY值的包,我想要一个单独的查询(这很重要,我知道如何分别为每个包执行该查询),该查询给出每个包中独占值的百分比(相对于所有其他包) 这里有一个例子 BAG | VALUE 1 | 100 1 | 102 1 | 100 2 | 100 2 | 101 2 | 101 3 | 103 3 | 103 3 | 102 3 | 104 所以我在这里得到的是: BAG | MINIMUM EXCLUSIVE VALUE

我想知道是否有可能(以及如何)在值的范围内获得最小的不同值重叠

例如,假设我有3个XY值的包,我想要一个单独的查询(这很重要,我知道如何分别为每个包执行该查询),该查询给出每个包中独占值的百分比(相对于所有其他包)

这里有一个例子

BAG | VALUE
1 | 100
1 | 102
1 | 100
2 | 100
2 | 101
2 | 101
3 | 103
3 | 103
3 | 102
3 | 104
所以我在这里得到的是:

BAG | MINIMUM EXCLUSIVE VALUES
1 | 0  (no items here are exclusive)
2 | 0.5 (only item 101 is exclusive in this bag and since distinct count of all items in this bag is 2, 50% of the bag is exclusive)
3 | 0.666666 (items 103 and 104 are exclusive to this bag and since distinct count of all items in the bag is 3 this gives 66,66666% of exclusive items)

是否有任何方法可以通过单个bigquery查询实现这一点(例如,我不需要为集合中的每个行李重写此查询,因为可能有相当多的行李)。当然,查询可以有子查询,但不应为每个包绑定(硬编码)。

构建在@N.N.解决方案之上:

Select BAG,VALUE, IF(CNT_BagsPerValue>1,0,CNT/CNT_ValuesPerBag) as MIN_EXCLUSIVE_VALUES
FROM
(Select BAG,VALUE,CNT_BagsPerValue,CNT_ValuesPerBag,Count(*) as CNT
FROM
(
Select BAG,VALUE,--Count(*) as CNT,
Count(Distinct BAG) OVER(Partition BY VALUE) as CNT_BagsPerValue
, Count(Distinct VALUE) OVER(Partition BY BAG) as CNT_ValuesPerBag
from 
(Select 1 as BAG, 100 AS VALUE),
(Select 1 as BAG, 102 AS VALUE),
(Select 1 as BAG, 100 AS VALUE),
(Select 2 as BAG, 100 AS VALUE),
(Select 2 as BAG, 101 AS VALUE),
(Select 2 as BAG, 101 AS VALUE),
(Select 3 as BAG, 103 AS VALUE),
(Select 3 as BAG, 103 AS VALUE),
(Select 3 as BAG, 102 AS VALUE),
(Select 3 as BAG, 104 AS VALUE),
)
GROUP BY BAG,VALUE,CNT_BagsPerValue,CNT_ValuesPerBag,)
SELECT BAG, SUM(is_unique)/MAX(CVB) as MINIMUM_EXCLUSIVE_VALUES
FROM
    (
    SELECT BAG, VALUE, MAX(IF(CBV=1,1,0)) as is_unique , MAX(CVB) as CVB
    FROM
        (
        SELECT BAG,
                VALUE,
                Count(Distinct BAG) OVER(Partition BY VALUE) as CBV, 
                Count(Distinct VALUE) OVER(Partition BY BAG) as CVB
         FROM 
          (Select 1 as BAG, 100 AS VALUE),
          (Select 1 as BAG, 102 AS VALUE),
          (Select 1 as BAG, 100 AS VALUE),
          (Select 2 as BAG, 100 AS VALUE),
          (Select 2 as BAG, 101 AS VALUE),
          (Select 2 as BAG, 101 AS VALUE),
          (Select 3 as BAG, 103 AS VALUE),
          (Select 3 as BAG, 103 AS VALUE),
          (Select 3 as BAG, 102 AS VALUE),
          (Select 3 as BAG, 104 AS VALUE),
         ORDER BY BAG
         )
    GROUP BY BAG, VALUE
)
GROUP BY BAG

在@N.N.的解决方案之上构建:

SELECT BAG, SUM(is_unique)/MAX(CVB) as MINIMUM_EXCLUSIVE_VALUES
FROM
    (
    SELECT BAG, VALUE, MAX(IF(CBV=1,1,0)) as is_unique , MAX(CVB) as CVB
    FROM
        (
        SELECT BAG,
                VALUE,
                Count(Distinct BAG) OVER(Partition BY VALUE) as CBV, 
                Count(Distinct VALUE) OVER(Partition BY BAG) as CVB
         FROM 
          (Select 1 as BAG, 100 AS VALUE),
          (Select 1 as BAG, 102 AS VALUE),
          (Select 1 as BAG, 100 AS VALUE),
          (Select 2 as BAG, 100 AS VALUE),
          (Select 2 as BAG, 101 AS VALUE),
          (Select 2 as BAG, 101 AS VALUE),
          (Select 3 as BAG, 103 AS VALUE),
          (Select 3 as BAG, 103 AS VALUE),
          (Select 3 as BAG, 102 AS VALUE),
          (Select 3 as BAG, 104 AS VALUE),
         ORDER BY BAG
         )
    GROUP BY BAG, VALUE
)
GROUP BY BAG
试试这个:

select max_bag, sum(exclusive) / count(*) from
(select value, max(bag) as max_bag, if(min(bag) == max(bag), 1, 0) as exclusive
from 
  (Select 0 as BAG, 0 AS VALUE),
  (Select 1 as BAG, 100 AS VALUE),
  (Select 1 as BAG, 102 AS VALUE),
  (Select 1 as BAG, 100 AS VALUE),
  (Select 2 as BAG, 100 AS VALUE),
  (Select 2 as BAG, 101 AS VALUE),
  (Select 2 as BAG, 101 AS VALUE),
  (Select 3 as BAG, 103 AS VALUE),
  (Select 3 as BAG, 103 AS VALUE),
  (Select 3 as BAG, 102 AS VALUE),
  (Select 3 as BAG, 104 AS VALUE)
group by value
)
group by max_bag
唯一的问题是,这将忽略分布等于零的结果(在这种情况下为袋1)。我希望它能够在几秒钟内处理您的数据(您可能需要每个组使用)

编辑:

试试这个:

select max_bag, sum(exclusive) / count(*) from
(select value, max(bag) as max_bag, if(min(bag) == max(bag), 1, 0) as exclusive
from 
  (Select 0 as BAG, 0 AS VALUE),
  (Select 1 as BAG, 100 AS VALUE),
  (Select 1 as BAG, 102 AS VALUE),
  (Select 1 as BAG, 100 AS VALUE),
  (Select 2 as BAG, 100 AS VALUE),
  (Select 2 as BAG, 101 AS VALUE),
  (Select 2 as BAG, 101 AS VALUE),
  (Select 3 as BAG, 103 AS VALUE),
  (Select 3 as BAG, 103 AS VALUE),
  (Select 3 as BAG, 102 AS VALUE),
  (Select 3 as BAG, 104 AS VALUE)
group by value
)
group by max_bag
唯一的问题是,这将忽略分布等于零的结果(在这种情况下为袋1)。我希望它能够在几秒钟内处理您的数据(您可能需要每个组使用)

编辑:

好的,这似乎是一个可行的解决方案。而且它也是非常优化的。我还没有做所有的测试,所以我不会把它作为公认的答案,但从一些测试中,我们做的似乎是正确的

编辑: 做了一些测试,这是最快的解决方案对8毫升项目和500袋

好的,这似乎是一个可行的解决方案。而且它也是非常优化的。我还没有做所有的测试,所以我不会把它作为公认的答案,但从一些测试中,我们做的似乎是正确的

编辑:

做了一些测试,这是最快的解决方案,用于8毫升物品和500包。

目前还没有。还是不符合OP的要求,还没有。它仍然不符合OP的要求。虽然您的答案(以及低于一个)在技术上是正确的,但我不知道是否可以说这是一个答案,因为在8000000行上,有2000个行李和800000个不同的值(整个数据大小约为280MB),我在10分钟后的查询仍在执行。。。我希望他们不会像处理太字节那样给我开账单:)哇,30分钟,还在执行!我将尝试获得支持以取消查询!您是否尝试按每个分组而不是按分组?无需惊慌,如果您的表大小为±280MB,则这是您将收取的最大处理数据量:)@RadekMichna这是一个非常棒的消息。你能提供这方面的来源吗?虽然你的答案(和低于一个)在技术上是正确的,但我不知道我是否能说这是一个答案,因为在8000000行上有2000个行李和800000个不同的值(整个数据大小约为280MB),我在10分钟后的查询仍在执行。。。我希望他们不会像处理太字节那样给我开账单:)哇,30分钟,还在执行!我将尝试获得支持以取消查询!您是否尝试按每个分组而不是按分组?无需惊慌,如果您的表大小为±280MB,则这是您将收取的最大处理数据量:)@RadekMichna这是一个非常棒的消息。你能提供这方面的资料吗?这确实有作用!你能证实这只给出了不同值的百分比吗?呃,我高兴得太早了。它对最小和最大行李数都不起作用。(选择0作为行李,0作为值),(选择1作为行李,100作为值),(选择1作为行李,102作为值),(选择1作为行李,100作为值),(选择2作为行李,100作为值),(选择2作为行李,101作为值),(选择2作为行李,101作为值),(选择3作为行李,108作为值),(选择3为袋,101为值),(选择3为袋,101为值),(选择0为袋,103为值),(选择0为袋,103为值),(选择0为袋,102为值),(选择0为袋,104为值),(选择4为袋,203为值),(选择4为袋,203为值),(选择4为袋,202为值),(选择4为袋,204为值)@lord.fist是的,你是对的…有问题,我会尝试更新它以进行更正,因为上层数据包0不应该有100%的排他性,因为数据包1中也存在值102…这实际上起了作用!你能验证一下这只给出了不同值的百分比吗?嗯,我高兴得太早了。它对最小值和最大值都不起作用行李编号。(选择0作为行李,0作为价值),(选择1作为行李,100作为价值),(选择1作为行李,102作为价值),(选择1作为行李,100作为价值),(选择2作为行李,100作为价值),(选择2作为行李,101作为价值),(选择2作为行李,101作为价值),(选择3作为行李,101作为价值),(选择3作为行李,101作为价值),(选择0作为行李,103作为值),(选择0作为行李,103作为值),(选择0作为行李,102作为值),(选择0作为行李,104作为值),(选择4作为行李,203作为值),(选择4作为行李,203作为值),(选择4作为行李,202作为值),(选择4作为行李,204作为值)@lord.fist是的,您是对的…有问题,我将尝试更新它以进行更正,因为上层数据包0不应具有100%的排他性,因为数据包1中也存在值102。。。