Language agnostic 组合布卢姆过滤器_Language Agnostic_Bloom Filter

Language agnostic 组合布卢姆过滤器

language-agnostic

Language agnostic 组合布卢姆过滤器,language-agnostic,bloom-filter,Language Agnostic,Bloom Filter,我使用bloom过滤器检查集合中的重复数据。但是，需要将两组数据的结果合并到一个过滤器中，以检查两组数据之间的重复。我用伪Python设计了一个函数来执行此任务： def combine(a : bloom_filter, b : bloom_filter): assert a.length == b.length assert a.hashes == b.hashes c = new bloom_filter(length = a.length, hashes = b

我使用bloom过滤器检查集合中的重复数据。但是，需要将两组数据的结果合并到一个过滤器中，以检查两组数据之间的重复。我用伪Python设计了一个函数来执行此任务：

def combine(a : bloom_filter, b : bloom_filter):
    assert a.length == b.length
    assert a.hashes == b.hashes

    c = new bloom_filter(length = a.length, hashes = b.hashes)
    c.attempts = a.attempts + b.attempts
    c.bits = a.bits | b.bits

    # Determining the amount of items
    a_and_b = count(a & b)
    a_not_b = count(a & !b)
    not_a_b = count(!a & b)
    neither = count(!a & !b)
    c.item_count = a_not_b / a.length * a.item_count
                 + not_a_b / b.length * b.item_count
                 + a_and_b / c.length * min(a.item_count, b.item_count)

    return c

这听起来正确吗？由于源数据的大部分信息都丢失了（这是布卢姆过滤器的要点），我正在进行大量的内部辩论，以确定是否有可能达到我的目的。

您可以推导出一个公式，用于估算布卢姆过滤器的项目数量：

c = log(z / N) / ((h * log(1 - 1 / N))

N: Number of bits in the bit vector
h: Number of hashes
z: Number of zero bits in the bit vector

这可以相当准确地估计Bloom过滤器中的项目数。你可以用简单的减法来估算贡献。

这是可能的。。。。。有点

让我们假设集合A包含苹果和桔子

假设B组包含豌豆和胡萝卜

构造一个简单的16位bloom过滤器作为示例，CRC32作为散列

crc32(apples) = 0x70CCB02F

crc32(oranges) = 0x45CDF3B4

crc32(peas) = 0xB18D0C2B

crc32(carrots) = 0x676A9E28

启动两组（A、B）的空布卢姆过滤器（BF）（如16位）

然后，将散列分解为一些位长度，我们将在这里使用4 我们可以在炉子里加苹果。 e、 g

所以现在苹果和橙子被插入BF1 w/最终值

1011 1000 1011 1101

对BFB也要这样做

crc32(peas) = 0xB18D0C2B becomes => 
set [11,2,12,0,13,1,8] in BFB
 0011 1001 0000 0011 = BF(peas)

crc32(carrots) = 0x676A9E28 becomes => 
set [8,2,14,9,10,6,7] in BFB

0100 0111 1100 0100 = BF(carrots)

so BFB = 
0011 1001 0000 0011  BF(peas)
0100 0111 1100 0100  BF(carrots)
===================  ('add' them to BFB via locial or op)
0111 1111 1100 0111

您现在可以在循环中搜索B项，反之亦然：

B是否包含“橙子”=>

因为此结果

（0011 1000 0000）

与原来的BF的橙子，你可以肯定，B不包含任何橙子

。。。（其余项目的do）

下面，B不包含任何A项，就像B不含任何苹果一样

我不认为这是你所要求的，而且看起来你可以用计算机来改变男朋友，这更符合你的观点。看起来你可以做一个异或运算，这会给你一个包含两个不同点的“单个”数组：

0111 1111 1100 0111 (BFB)
1011 1000 1011 1101 (BFA)
========================
1100 0111 0111 1010 (BFA xor BFB) == (items in B not in A, and items in A not in B)

也就是说，通过这个BF，您可以100%的时间检测到某个项目不存在，只是不存在的项目100%

您将使用它的方式如下（检查豌豆是否“从A中丢失”）：

自从

（BFA-xor-BFB）和&（Peas）！=0

您知道一个集合不包含“豌豆”

再一次，您需要逐项进行测试，也许您可以进行聚合，但可能不是一个好主意

希望这有帮助

0x45CDF3B4 = 0100 0101 1100 1101 1111 0011 1011 0100
              4    5    12   13   15    3   11   4
----------------------------------------------------
Add oranges to BF by setting BF bit indexes [ 4,5,12,13,15,3,11,4]

Oranges =      1011 1000 0011 1000 
BFA =          1001 1000 1000 0101  (or operation)
================================
Updated BFA =  1011 1000 1011 1101

crc32(peas) = 0xB18D0C2B becomes => 
set [11,2,12,0,13,1,8] in BFB
 0011 1001 0000 0011 = BF(peas)

crc32(carrots) = 0x676A9E28 becomes => 
set [8,2,14,9,10,6,7] in BFB

0100 0111 1100 0100 = BF(carrots)

so BFB = 
0011 1001 0000 0011  BF(peas)
0100 0111 1100 0100  BF(carrots)
===================  ('add' them to BFB via locial or op)
0111 1111 1100 0111

 1011 1000 0011 1000 (Oranges BF representation)
 0111 1111 1100 0111 (BFB)
=====================     (and operation)
 0011 1000 0000 0000

0111 1111 1100 0111 (BFB)
1011 1000 1011 1101 (BFA)
========================
1100 0111 0111 1010 (BFA xor BFB) == (items in B not in A, and items in A not in B)

 1100 0111 0111 1010 (BFA xor BFB)
 0011 1001 0000 0011 (Peas)
============================== (And operation)
 0000 0001 0000 0010 (non-zero)