C 现在实现一个slab分配器值得吗?
我在一台服务器上工作,必须同时读取数千个连接的套接字客户端。客户机请求由具有大约32字节的完全相同大小的消息组成 我正在阅读有关C 现在实现一个slab分配器值得吗?,c,linux,performance,sockets,memory,C,Linux,Performance,Sockets,Memory,我在一台服务器上工作,必须同时读取数千个连接的套接字客户端。客户机请求由具有大约32字节的完全相同大小的消息组成 我正在阅读有关slab分配器的文章,我想在调用read从套接字中取出数据时,在我的应用程序中使用这种特殊的技术(read将数据从内核缓冲区复制到我选择的缓冲区中,我想使用一些动态分配的内存) 当我阅读时,Linux内核似乎已经在使用这种技术。如果这用于实现malloc或new,那么在分配已经完成的情况下,我还值得这么做吗 我在想,在没有SLAB算法的情况下在堆栈上使用分配可能会更好,
slab分配器的文章,我想在调用read
从套接字中取出数据时,在我的应用程序中使用这种特殊的技术(read
将数据从内核缓冲区复制到我选择的缓冲区中,我想使用一些动态分配的内存)
当我阅读时,Linux内核似乎已经在使用这种技术。如果这用于实现malloc或new
,那么在分配已经完成的情况下,我还值得这么做吗
我在想,在没有SLAB算法的情况下在堆栈上使用分配可能会更好,但我不确定哪种方法是最好的。如果你是一名C程序员,当然你应该在内存管理上弄脏
然而,您可能不会在简单地malloc'ing每个请求时遇到任何问题,除非您真的接近您的机器的极限,这似乎不太可能。但我相信知道你的选择比听别人的话要好。这里有一些想法要考虑。
静态阵列
最简单的替代方法是使用单个请求插槽的全局数组,并跟踪正在使用的插槽。这意味着对请求数量有一个静态限制,但另一方面,没有开销,也没有碎片问题。把限制定得很高
下面是一个示例实现。如果您不熟悉按位操作,它可能看起来有点混乱,但要点是我们有一个额外的数组,每个请求插槽包含一个位(开或关),指定该插槽是否正在使用。相反,您可以向结构本身添加一个“is_used”变量,但这将导致结构的填充量大大超过一位,这与我们最小化开销的目标背道而驰
头文件非常小(顺便说一句,这是C语言的真正优点!)
源文件:
#include the header file
/* note: this example is not written with multithreading in mind */
/* allow a million requests (total 32 MB + 128 KB of memory) */
#define MAX_REQUESTS (1*1024*1024)
static request_t g_requests[MAX_REQUESTS];
/* use one bit per request to store whether it's in use */
/* unsigned int is 32 bits. shifting right by 5 divides by 32 */
static unsigned int g_requests_used[MAX_REQUESTS >> 5];
request_t *alloc_request(void) {
/* note: this is a very naive method. you really don't want to search
* from the beginning every time, but i'll leave improving that as an
* exercise for you. */
unsigned int word_bits;
unsigned int word, bit;
/* look through the bit array one word (i.e., 32 bits) at a time */
for (word = 0; word < (MAX_REQUESTS >> 5); word++) {
word_bits = g_requests_used[word];
/* we can tell right away whether the entire chunk of 32 requests is
* in use, and avoid the inner loop */
if (word_bits == 0xFFFFFFFFU)
continue;
/* now we know there is a gap somewhere in this chunk, so we loop
* through the 32 bits to find it */
for (bit = 0; bit < 32; bit++) {
if (word_bits & (1U << bit))
continue; /* bit is set, slot is in use */
/* found a free slot */
g_requests_used[word] |= 1U << bit;
return &g_requests[(word << 5) + bit];
}
}
/* we're all out of requests! */
return NULL;
}
void free_request(request_t *req) {
/* make sure the request is actually within the g_requests block of
* memory */
if (req >= g_requests && req < g_requests + MAX_REQUESTS) {
/* find the overall index of this request. pointer arithmetic like this
* is somewhat peculiar to c/c++, you may want to read up on it. */
ptrdiff_t index = req - g_requests;
/* reducing a ptrdiff_t to an unsigned int isn't something you should
* do without thinking about it first. but in our case, we're fine as
* long as we don't allow more than 2 billion requests, not that our
* computer could handle that many anyway */
unsigned int u_index = (unsigned int)index;
/* do some arithmetic to figure out which bit of which word we need to
* turn off */
unsigned int word = u_index >> 5; /* index / 32 */
unsigned int bit = u_index & 31; /* index % 32 */
g_requests_used[word] &= ~(1U << bit);
}
}
当流量减少时,您希望能够释放池,否则这与静态限制几乎没有什么不同!不幸的是,在这种情况下,您将处理碎片问题。您可能会发现自己有很多使用很少的池,特别是如果请求有时可能会持续很长时间的话。直到池上的最后一个插槽都为空,才能释放整个池。您仍然可以节省开销(考虑到单个请求的小规模),但处理碎片可能会将这从一个小而优雅的解决方案转变为超出其价值的工作
您可以减少每个池的请求数以减少碎片化的影响,但此时我们将失去这种方法的优势
哪一个?
首先,您应该考虑替代单个malloc的主要原因是:结构的小尺寸(32字节)、大数量以及创建和销毁它们的频率
- 静态阵列大大降低了开销,但在当今时代很难证明这一点。除非您的服务器正在Arduino上运行
- 内存池是解决此类问题的明显方向,但它们可能需要相当多的工作才能顺利工作。如果这是你的拿手好戏,那么我说去做吧
- Slab分配器类似于复杂的内存池,不受特定单个结构大小的限制。因为您只有32字节的请求,所以它们对您来说太过分了,尽管您可能会找到一个适合您的第三方库
走简单的路线,简单地对每个请求进行malloc'ing是一个滑铁卢,最终可能会导致您完全放弃C。) “堆叠”!=“malloc”,基准测试第一!只有当你发现你的代码在malloc和free上花费了大量的时间,或者如果你经历了内存碎片,那么一定要找一个定制的分配器。成千上万的优化听起来并不是一个有帮助的重要原因。Novadaysmalloc()
实现都有用于小块的slab。Linux内核已经有slab实现,当您执行malloc()
时,最终是内核为您分配内存(使用slab分配器)。因此,您很可能不需要实现自己的分配器。无论如何,只有在对应用程序进行适当的分析(可能是使用valgrind
)并进行调查之后,您才能确定这一点。啊!多好的回答啊!:)如果我可以建议对一个更快的池进行一次轻微的扩充,请考虑内存池的联合链表策略。例如:union Chunk{struct request_s request;union Chunk*next;}
然后将池中使用的requests\u
数组替换为外部变量,该外部变量存储此单链接块列表的头的union Chunk*head
。创建新池时,将所有可用块推送到空闲列表中。分配时,使用请求数据覆盖区块。解除分配时,将其强制转换回Chunk*
,然后再次使用next
指针将其添加回空闲列表。此处没有每个节点的分配,内存仍然来自池缓冲区阵列。您只是将池中的一个块的概念加倍,有时作为一个单独链接的列表节点(当它空闲时)和使用它时的元素(请求)。这避免了需要扫描其他一些位阵列以获得空闲插槽,并使用更少的内存。缺点是您无法轻松释放池,因为很难判断池何时完全变空。您可以快速回收空空间。记住在使用mem分配器时要注意对齐。谢谢你的建议。我已经
#include the header file
/* note: this example is not written with multithreading in mind */
/* allow a million requests (total 32 MB + 128 KB of memory) */
#define MAX_REQUESTS (1*1024*1024)
static request_t g_requests[MAX_REQUESTS];
/* use one bit per request to store whether it's in use */
/* unsigned int is 32 bits. shifting right by 5 divides by 32 */
static unsigned int g_requests_used[MAX_REQUESTS >> 5];
request_t *alloc_request(void) {
/* note: this is a very naive method. you really don't want to search
* from the beginning every time, but i'll leave improving that as an
* exercise for you. */
unsigned int word_bits;
unsigned int word, bit;
/* look through the bit array one word (i.e., 32 bits) at a time */
for (word = 0; word < (MAX_REQUESTS >> 5); word++) {
word_bits = g_requests_used[word];
/* we can tell right away whether the entire chunk of 32 requests is
* in use, and avoid the inner loop */
if (word_bits == 0xFFFFFFFFU)
continue;
/* now we know there is a gap somewhere in this chunk, so we loop
* through the 32 bits to find it */
for (bit = 0; bit < 32; bit++) {
if (word_bits & (1U << bit))
continue; /* bit is set, slot is in use */
/* found a free slot */
g_requests_used[word] |= 1U << bit;
return &g_requests[(word << 5) + bit];
}
}
/* we're all out of requests! */
return NULL;
}
void free_request(request_t *req) {
/* make sure the request is actually within the g_requests block of
* memory */
if (req >= g_requests && req < g_requests + MAX_REQUESTS) {
/* find the overall index of this request. pointer arithmetic like this
* is somewhat peculiar to c/c++, you may want to read up on it. */
ptrdiff_t index = req - g_requests;
/* reducing a ptrdiff_t to an unsigned int isn't something you should
* do without thinking about it first. but in our case, we're fine as
* long as we don't allow more than 2 billion requests, not that our
* computer could handle that many anyway */
unsigned int u_index = (unsigned int)index;
/* do some arithmetic to figure out which bit of which word we need to
* turn off */
unsigned int word = u_index >> 5; /* index / 32 */
unsigned int bit = u_index & 31; /* index % 32 */
g_requests_used[word] &= ~(1U << bit);
}
}
#define REQUESTS_PER_POOL 1024
typedef struct request_pool_s request_pool_t;
struct request_pool_s {
request_t requests[REQUESTS_PER_POOL];
unsigned int requests_used[REQUESTS_PER_POOL >> 5];
request_pool_t *prev;
request_pool_t *next;
};