Python 为稀疏64位无符号整数创建最小完美哈希_Python_C_Perfect Hash

Python 为稀疏64位无符号整数创建最小完美哈希

python c

Python 为稀疏64位无符号整数创建最小完美哈希,python,c,perfect-hash,Python,C,Perfect Hash,我需要一个64位到16位的完美哈希函数，用于稀疏填充的密钥列表我有一个python字典，它有48326个长度为64位的键。我想为这个密钥列表创建一个最小的完美散列。（我不想等待几天来计算MPH，所以我也可以将其映射到16位散列）我们的目标是最终将该字典作为一个数组移植到C，该数组包含dict值，索引是通过以键为输入的最小完美哈希函数计算的。我无法在正在构建的应用程序的C端口中使用外部哈希库问题: 是否有任何python库将我的键作为输入，并向我提供散列参数和（基于用于散列的已定义算法）作为

我需要一个64位到16位的完美哈希函数，用于稀疏填充的密钥列表

我有一个python字典，它有48326个长度为64位的键。我想为这个密钥列表创建一个最小的完美散列。（我不想等待几天来计算MPH，所以我也可以将其映射到16位散列）

我们的目标是最终将该字典作为一个数组移植到C，该数组包含dict值，索引是通过以键为输入的最小完美哈希函数计算的。我无法在正在构建的应用程序的C端口中使用外部哈希库

问题: 是否有任何python库将我的键作为输入，并向我提供散列参数和（基于用于散列的已定义算法）作为输出

我找到了一个库，但由于我的密钥是64位形式的，所以它挂起了。（即使我在2000个键的子集上进行了测试）

编辑正如我在评论中所建议的，我查看并修改了哈希函数，使其采用64位整数（根据此更改FNV素数和偏移量的值）

虽然我得到了结果，但不幸的是，映射返回了-ve索引值，而我可以让它工作，这意味着我必须通过检查-ve索引向哈希计算中添加另外4个周期

要避免这种情况，我个人只需要生成一个包含大量键的表，或者生成一个包含大量键的表

如果您必须在Python中执行此操作，那么我发现一些Python 2代码非常有效地使用中间表将字符串键转换为最小完美哈希

根据您的需求调整帖子中的代码，在0.35秒内为50k个项目生成一个最小的完美哈希：

>>> import random
>>> testdata = {random.randrange(2**64): random.randrange(2**64)
...             for __ in range(50000)}  # 50k random 64-bit keys
>>> import timeit
>>> timeit.timeit('gen_minimal_perfect_hash(testdata)', 'from __main__ import  gen_minimal_perfect_hash, testdata', number=10)
3.461486832005903

我所做的改变是：

我切换到Python3，遵循PythonStyleGuide并使代码更具Python风格
我正在使用
我使用一个标志来区分中间表中的直接输入值和散列输入值，而不是存储负数

修改后的代码：

# Easy Perfect Minimal Hashing
# By Steve Hanov. Released to the public domain.
# Adapted to Python 3 best practices and 64-bit integer keys by Martijn Pieters
#
# Based on:
# Edward A. Fox, Lenwood S. Heath, Qi Fan Chen and Amjad M. Daoud,
# "Practical minimal perfect hash functions for large databases", CACM, 35(1):105-121
# also a good reference:
# Compress, Hash, and Displace algorithm by Djamal Belazzougui,
# Fabiano C. Botelho, and Martin Dietzfelbinger
from itertools import count, groupby


def fnv_hash_int(value, size, d=0x811c9dc5):
    """Calculates a distinct hash function for a given 64-bit integer.

    Each value of the integer d results in a different hash value. The return
    value is the modulus of the hash and size.

    """
    # Use the FNV algorithm from http://isthe.com/chongo/tech/comp/fnv/
    # The unsigned integer is first converted to a 8-character byte string.
    for c in value.to_bytes(8, 'big'):
        d = ((d * 0x01000193) ^ c) & 0xffffffff

    return d % size


def gen_minimal_perfect_hash(dictionary, _hash_func=fnv_hash_int):
    """Computes a minimal perfect hash table using the given Python dictionary.

    It returns a tuple (intermediate, values). intermediate and values are both
    lists. intermediate contains the intermediate table of indices needed to
    compute the index of the value in values; a tuple of (flag, d) is stored, where
    d is either a direct index, or the input for another call to the hash function.
    values contains the values of the dictionary.

    """
    size = len(dictionary)

    # Step 1: Place all of the keys into buckets
    buckets = [[] for __ in dictionary]
    intermediate = [(False, 0)] * size
    values = [None] * size

    for key in dictionary:
        buckets[_hash_func(key, size)].append(key)

    # Step 2: Sort the buckets and process the ones with the most items first.
    buckets.sort(key=len, reverse=True)
    # Only look at buckets of length greater than 1 first; partitioned produces
    # groups of buckets of lengths > 1, then those of length 1, then the empty
    # buckets (we ignore the last group).
    partitioned = (g for k, g in groupby(buckets, key=lambda b: len(b) != 1))
    for bucket in next(partitioned, ()):
        # Try increasing values of d until we find a hash function
        # that places all items in this bucket into free slots
        for d in count(1):
            slots = {}
            for key in bucket:
                slot = _hash_func(key, size, d=d)
                if values[slot] is not None or slot in slots:
                    break
                slots[slot] = dictionary[key]
            else:
                # all slots filled, update the values table; False indicates
                # these values are inputs into the hash function
                intermediate[_hash_func(bucket[0], size)] = (False, d)
                for slot, value in slots.items():
                    values[slot] = value
                break

    # The next group is buckets with only 1 item. Process them more quickly by
    # directly placing them into a free slot.
    freelist = (i for i, value in enumerate(values) if value is None)

    for bucket, slot in zip(next(partitioned, ()), freelist):
        # These are 'direct' slot references
        intermediate[_hash_func(bucket[0], size)] = (True, slot)
        values[slot] = dictionary[bucket[0]]

    return (intermediate, values)


def perfect_hash_lookup(key, intermediate, values, _hash_func=fnv_hash_int):
    "Look up a value in the hash table defined by intermediate and values"
    direct, d = intermediate[_hash_func(key, len(intermediate))]
    return values[d if direct else _hash_func(key, len(values), d=d)]

上面生成两个列表，每个列表有50k个条目；第一个表中的值是

（boolean，integer）

元组，整数的范围是

[0，tablesize）

（理论上，值的范围可能是2^16，但如果需要65k+的尝试才能找到数据的插槽排列，我会非常惊讶）。您的表大小小于50k，因此，在将其表示为C数组时，上述安排可以将此列表中的条目存储为4个字节（

bool

和

short

make 3，但添加一个字节的填充）

快速测试哈希表是否正确，并再次生成正确的输出：

>>> tables = gen_minimal_perfect_hash(testdata)
>>> for key, value in testdata.items():
...     assert perfect_hash_lookup(key, *tables) == value
...

您只需要在C中实现查找功能：

```
fnv\u hash\u int
```
操作可以获取一个指向64位整数的指针，然后将该指针转换为一个8位值数组，并将索引递增8次以访问每个单独的字节；使用
您不需要在C中使用
```
0xffffffff
```
屏蔽，因为C整数值上的溢出将自动丢弃

len（中间）==len（值）==len（字典）

并且可以在常量中捕获

假设C99，将中间表存储为结构类型的数组，
```
flag
```
为
```
bool
```
，
```
d
```
为无符号的
```
short
```
；仅为3个字节，加上1个填充字节以对齐4个字节的边界。
```
值
```
数组中的数据类型取决于输入词典中的值

如果您原谅我的C技巧，下面是一个示例实现：

mph\u表.h

#包括“mph_生成的表格.h”
#包括
#包括
#ifndef htonll
//看https://stackoverflow.com/q/3022552
#定义htonl（x）（（1==htonl（1））？（x）：（（uint64_t）htonl（（x）&0xffffff）>32））
#恩迪夫
uint64\u t mph\u查找（uint64\u t键）；

mph\u表c

#包括“mph_table.h”
#包括
#包括
#定义FNV_偏移量0x811c9dc5
#定义FNV_素数0x01000193
uint32\u t fnv\u散列\u模\u表（uint32\u t d，uint64\u t键）{
d=（d==0）？FNV_偏移量：d；
uint8_t*keybytes=（uint8_t*）&key；
对于（int i=0；i<8；++i）{
d=（d*FNV_素数）^keybytes[i]；
}
返回d%表格大小；
}
uint64\u t mph\u查找（uint64\u t键）{
_中间项目=
mph_表。中间[fnv_哈希_模_表（0，htonll（键））]；
返回mph_表中的值[
入境标志？
条目d：
fnv_哈希_模_表（（uint32_t）entry.d，htonll（key））；
}

它将依赖于生成的头文件，生成于：

from textwrap import indent

template = """\
#include <stdbool.h>
#include <stdint.h>

#define TABLE_SIZE %(size)s

typedef struct _intermediate_entry {
    bool flag;
    uint16_t d;
} _intermediate_entry;
typedef struct mph_tables_t {
    _intermediate_entry intermediate[TABLE_SIZE];
    uint64_t values[TABLE_SIZE];
} mph_tables_t;

static const mph_tables_t mph_tables = {
    {  // intermediate
%(intermediate)s
    },
    {  // values
%(values)s
    }
};
"""

tables = gen_minimal_perfect_hash(dictionary)
size = len(dictionary)
cbool = ['false, ', 'true,  ']
perline = lambda i: zip(*([i] * 10))
entries = (f'{{{cbool[e[0]]}{e[1]:#06x}}}' for e in tables[0])
intermediate = indent(',\n'.join([', '.join(group) for group in perline(entries)]), ' ' * 8)
entries = (format(v, '#018x') for v in tables[1])
values = indent(',\n'.join([', '.join(group) for group in perline(entries)]), ' ' * 8)

with open('mph_generated_table.h', 'w') as generated:
    generated.write(template % locals())

来自textwrap导入缩进
模板=“”“\
#包括
#包括
#定义表格大小%（大小）
类型定义结构\u中间\u条目{
布尔旗；
UINT16td；
}(中)(中)分录;；
类型定义结构mph\u表{
_中间项目中间[表格大小]；
uint64_t值[表大小]；
}英里/小时；
静态常数mph_表\u t mph_表={
{//中间
%（中级）s
},
{//值
%（价值观）s
}
};
"""
tables=gen\u minimal\u perfect\u散列（字典）
大小=长度（字典）
cbool=['false'，'true'，]
perline=lambda i:zip（*[i]*10））
条目=（表[0]中e的f'{{{{cbool[e[0]]}{e[1]：#06x}}}}}
中间=缩进（'，\n'。连接（['，'。连接（组）以用于行中的组（条目）]，''*8）
条目=（表[1]中v的格式（v，#018x'））
值=缩进（'，\n'。连接（['，'。连接（组）以用于行中的组（条目）]，''*8）
生成时打开（'mph_生成的表.h'，'w'）：
已生成.write（模板%locals（））

其中，

字典

是您的输入表

使用

gcc-O3

编译，哈希函数是内联的（循环展开），整个

mphu查找