大数据集散列与C语言实现_C_Gcc_Data Structures_Hash

大数据集散列与C语言实现

c gcc data-structures hash

大数据集散列与C语言实现,c,gcc,data-structures,hash,C,Gcc,Data Structures,Hash,我有大量的值，范围从0到5463458053。对于每个值，我希望映射一个包含字符串的集合，以便操作查找，I。E查找该集中是否存在字符串花费的时间最少。请注意，这组值可能不包含（0-5463458053）中的所有值，但确实包含大量值我目前的解决方案是散列这些值（0-5463458053之间），对于每个值，都有一个对应于该值的字符串链接列表。每次我想检查给定集合中的字符串时，我都会散列该值（介于0-5463458053之间），获取链表，并遍历它以确定它是否包含上述字符串虽然这看起来比较容易，但有

我有大量的值，范围从0到5463458053。对于每个值，我希望映射一个包含字符串的集合，以便操作查找，I。E查找该集中是否存在字符串花费的时间最少。请注意，这组值可能不包含（0-5463458053）中的所有值，但确实包含大量值
我目前的解决方案是散列这些值（0-5463458053之间），对于每个值，都有一个对应于该值的字符串链接列表。每次我想检查给定集合中的字符串时，我都会散列该值（介于0-5463458053之间），获取链表，并遍历它以确定它是否包含上述字符串
虽然这看起来比较容易，但有点费时。你能想出一个更快的解决办法吗？而且，碰撞将是可怕的。它们会导致错误的结果
另一部分是关于在C中实现这一点。我将如何实现这一点
注意：有人建议改用数据库。我想知道这是否有用

我有点担心内存自然会用完。：-）
如果条目从0到N且连续：使用数组。（索引速度够快吗？）
编辑：数字似乎不是连续的。有大量的{key，value}对，其中key是一个大数字（>32位但<64位），value是一组字符串
如果内存可用，则哈希表很容易，如果字符串串不是太大，则可以按顺序检查它们。如果相同的字符串不止一次出现（很多次），您可以枚举这些字符串（将指向它们的指针放在char*数组[]中，并使用该数组中的索引。查找给定字符串的索引可能涉及另一个哈希表）
对于“主”哈希表，条目可能是：

struct entry { struct entry *next; /* for overflow chain */ unsigned long long key; /* the 33bits number */ struct list *payload; } entries[big_enough_for_all] ; /* if size is known in advance , preallocation avoids a lot of malloc overhead */
如果您有足够的内存来存储磁头阵列，您当然可以这样做：

struct entry *heads[SOME_SIZE] = {NULL, };
，否则可以将heads数组与条目数组组合。（就像我在这里做的那样）

处理冲突很容易：当您遍历溢出链时，只需将您的密钥与条目中的密钥进行比较。如果它们不相等：继续走。如果它们相等：找到；现在开始遍历字符串。
如果条目从0到N且连续：使用数组。（索引速度够快吗？）
编辑：数字似乎不是连续的。有大量的{key，value}对，其中key是一个大数字（>32位但<64位），value是一组字符串
如果内存可用，则哈希表很容易，如果字符串串不是太大，则可以按顺序检查它们。如果相同的字符串不止一次出现（很多次），您可以枚举这些字符串（将指向它们的指针放在char*数组[]中，并使用该数组中的索引。查找给定字符串的索引可能涉及另一个哈希表）
对于“主”哈希表，条目可能是：

struct entry { struct entry *next; /* for overflow chain */ unsigned long long key; /* the 33bits number */ struct list *payload; } entries[big_enough_for_all] ; /* if size is known in advance , preallocation avoids a lot of malloc overhead */
如果您有足够的内存来存储磁头阵列，您当然可以这样做：

struct entry *heads[SOME_SIZE] = {NULL, };
，否则可以将heads数组与条目数组组合。（就像我在这里做的那样）

处理冲突很容易：当您遍历溢出链时，只需将您的密钥与条目中的密钥进行比较。如果它们不相等：继续走。如果它们相等：找到；现在开始遍历字符串。
您可以得到一个哈希集的哈希表。第一个哈希表的键是整数。其中的值是散列集，即键为字符串的散列表
您还可以有一个散列集，其中键是整数和字符串对
<> P>有许多库实现了这些数据结构（C++中，标准库正在实现它们，如<代码> STD:：MAP< /Cord>＆<代码> STD:：SET）。对于C，我想到的是GTK

使用散列技术，内存使用与所考虑的集合（或关系）的大小成比例。例如，您可以接受30%的空率。
您可以有一个哈希集的哈希表。第一个哈希表的键是整数。其中的值是散列集，即键为字符串的散列表
您还可以有一个散列集，其中键是整数和字符串对
<> P>有许多库实现了这些数据结构（C++中，标准库正在实现它们，如<代码> STD:：MAP< /Cord>＆<代码> STD:：SET）。对于C，我想到的是GTK
使用散列技术，内存使用与所考虑的集合（或关系）的大小成比例。例如，您可以接受30%的空率。
A和实现它的C库可能正是您需要的基础。下面是一段描述它的引文：
Judy是一个C库，提供最先进的核心技术它实现了一个稀疏动态数组。Judy数组被声明只需使用空指针。Judy阵列仅在已填充，但可以扩展以利用所有可用内存如果需要的话。Judy的主要优势是可扩展性、高性能和内存效率。Judy阵列是可扩展的，可以扩展到大量元素，仅由机器内存限定。自从 Judy被设计为无界数组，Judy数组的大小为未预先分配，但随阵列动态增长和收缩人口Judy将可伸缩性与易用性结合起来。朱迪API 通过简单的插入、检索和删除调用访问需要大量的编程。不需要进行调整和配置（事实上甚至不可能）。此外，排序、搜索、计数和 Judy的设计中内置了顺序访问功能
Judy可以在开发人员需要动态大小的阵列时使用，关联数组或简单易用的界面，不需要为扩展或con而返工
function iterate_all_sets () { node = first_node_in_tree(); while (node != null) { current_set = node.set_number; do_something_with(current_set); if (cannot increment current_set) { return; } node = lookup_tree_weak(current_set + 1, ""); if (node.set_number == current_set) { node = successor(node); } } }

#include <stdlib.h> #include <stdio.h> #include <inttypes.h> #include <assert.h> #include <structure/BAVL.h> #include <misc/offset.h> struct value { uint32_t set_no; char str[3]; }; struct node { uint8_t is_used; struct value val; BAVLNode tree_node; }; BAVL tree; static int value_comparator (void *unused, void *vv1, void *vv2) { struct value *v1 = vv1; struct value *v2 = vv2; if (v1->set_no < v2->set_no) { return -1; } if (v1->set_no > v2->set_no) { return 1; } int c = strcmp(v1->str, v2->str); if (c < 0) { return -1; } if (c > 0) { return 1; } return 0; } static void random_bytes (unsigned char *out, size_t n) { while (n > 0) { *out = rand(); out++; n--; } } static void random_value (struct value *out) { random_bytes((unsigned char *)&out->set_no, sizeof(out->set_no)); for (size_t i = 0; i < sizeof(out->str) - 1; i++) { out->str[i] = (uint8_t)32 + (rand() % 94); } out->str[sizeof(out->str) - 1] = '\0'; } static struct node * find_node (const struct value *val) { // find AVL tree node with an equal value BAVLNode *tn = BAVL_LookupExact(&tree, (void *)val); if (!tn) { return NULL; } // get node pointer from pointer to its value (same as container_of() in Linux kernel) struct node *n = UPPER_OBJECT(tn, struct node, tree_node); assert(n->val.set_no == val->set_no); assert(!strcmp(n->val.str, val->str)); return n; } static struct node * lookup_weak (const struct value *v) { BAVLNode *tn = BAVL_Lookup(&tree, (void *)v); if (!tn) { return NULL; } return UPPER_OBJECT(tn, struct node, tree_node); } static struct node * first_node (void) { BAVLNode *tn = BAVL_GetFirst(&tree); if (!tn) { return NULL; } return UPPER_OBJECT(tn, struct node, tree_node); } static struct node * next_node (struct node *node) { BAVLNode *tn = BAVL_GetNext(&tree, &node->tree_node); if (!tn) { return NULL; } return UPPER_OBJECT(tn, struct node, tree_node); } size_t num_found; static void iterate_all_strings_in_set (uint32_t set_no) { struct value v; v.set_no = set_no; v.str[0] = '\0'; struct node *n = lookup_weak(&v); if (!n) { return; } if (n->val.set_no != set_no) { n = next_node(n); } while (n && n->val.set_no == set_no) { num_found++; // "do_something_with_string" n = next_node(n); } } static void iterate_all_sets (void) { struct node *node = first_node(); while (node) { uint32_t current_set = node->val.set_no; iterate_all_strings_in_set(current_set); // "do_something_with_set" if (current_set == UINT32_MAX) { return; } struct value v; v.set_no = current_set + 1; v.str[0] = '\0'; node = lookup_weak(&v); if (node->val.set_no == current_set) { node = next_node(node); } } } int main (int argc, char *argv[]) { size_t num_nodes = 10000000; // init AVL tree, using: // key=(struct node).val, // comparator=value_comparator BAVL_Init(&tree, OFFSET_DIFF(struct node, val, tree_node), value_comparator, NULL); printf("Allocating...\n"); // allocate nodes (missing overflow check...) struct node *nodes = malloc(num_nodes * sizeof(nodes[0])); if (!nodes) { printf("malloc failed!\n"); return 1; } printf("Inserting %zu nodes...\n", num_nodes); size_t num_inserted = 0; // insert nodes, giving them random values for (size_t i = 0; i < num_nodes; i++) { struct node *n = &nodes[i]; // choose random set number and string random_value(&n->val); // try inserting into AVL tree if (!BAVL_Insert(&tree, &n->tree_node, NULL)) { printf("Insert collision: (%"PRIu32", '%s') already exists!\n", n->val.set_no, n->val.str); n->is_used = 0; continue; } n->is_used = 1; num_inserted++; } printf("Looking up...\n"); // lookup all those values for (size_t i = 0; i < num_nodes; i++) { struct node *n = &nodes[i]; struct node *lookup_n = find_node(&n->val); if (n->is_used) { // this node is the only one with this value ASSERT(lookup_n == n) } else { // this node was an insert collision; some other // node must have this value ASSERT(lookup_n != NULL) ASSERT(lookup_n != n) } } printf("Iterating by sets...\n"); num_found = 0; iterate_all_sets(); ASSERT(num_found == num_inserted) printf("Removing all strings...\n"); for (size_t i = 0; i < num_nodes; i++) { struct node *n = &nodes[i]; if (!n->is_used) { // must not remove it it wasn't inserted continue; } BAVL_Remove(&tree, &n->tree_node); } return 0; }