Python 自定义大小数组 简单的问题陈述:
在C或Cython中是否可以有一个自定义大小数据类型(3/5/6/7字节)的数组 背景: 在尝试编写复杂算法时,我遇到了内存效率低下的问题。该算法需要存储大量令人兴奋的数据。所有数据都排列在一个连续的内存块(如数组)中。数据只是一个非常长的列表,通常包含非常大的数字。给定一组特定的数字,此列表/数组中的数字类型是常量(它们几乎像常规C数组一样工作,其中所有数字在一个数组中的类型相同) 问题: 有时,以标准数据大小存储每个数字是没有效率的。通常正常的数据类型是char、short、int、long等。。。但是,如果我使用int数组来存储一个数据类型,该数据类型只能存储在3个字节的范围内,那么每个数字都会丢失1个字节的空间。这使得效率极低,而且当你存储数百万个数字时,其效果就是破坏记忆。不幸的是,没有其他方法来实现该算法的解决方案,我相信粗略实现自定义数据大小是实现这一点的唯一方法 我尝试的是: 我曾尝试使用字符数组来完成此任务,但在大多数情况下,不同的0-255值位之间的转换以形成更大的数据类型效率很低。通常情况下,有一种数学方法,可以将字符打包成一个更大的数字,或者取一个更大的数字,然后划分出各个字符。这是我尝试的一个非常低效的算法,用Cython编写:Python 自定义大小数组 简单的问题陈述:,python,c,arrays,cython,Python,C,Arrays,Cython,在C或Cython中是否可以有一个自定义大小数据类型(3/5/6/7字节)的数组 背景: 在尝试编写复杂算法时,我遇到了内存效率低下的问题。该算法需要存储大量令人兴奋的数据。所有数据都排列在一个连续的内存块(如数组)中。数据只是一个非常长的列表,通常包含非常大的数字。给定一组特定的数字,此列表/数组中的数字类型是常量(它们几乎像常规C数组一样工作,其中所有数字在一个数组中的类型相同) 问题: 有时,以标准数据大小存储每个数字是没有效率的。通常正常的数据类型是char、short、int、long
def to_bytes(long long number, int length):
cdef:
list chars = []
long long m
long long d
for _ in range(length):
m = number % 256
d = number // 256
chars.append(m)
number = d
cdef bytearray binary = bytearray(chars)
binary = binary[::-1]
return binary
def from_bytes(string):
cdef long long d = int(str(string).encode('hex'), 16)
return d
请记住,我并不想对该算法进行改进,而是想用一种基本的方法来声明某一数据类型的数组,因此我不必进行这种转换。在C中,您可以定义一种自定义数据类型来处理任意字节大小的复杂情况:
typedef struct 3byte { char x[3]; } 3byte;
然后,您就可以完成所有美好的事情,如通过值传递、获得正确的
大小以及创建这种类型的数组。您可以使用压缩位字段。哦,那看起来像
typedef struct __attribute__((__packed__)) {
int x : 24;
} int24;
对于int24x
,x.x
的行为与24位int非常相似。您可以创建一个数组,它不会有任何不必要的填充。注意,这将比使用普通INT慢;数据不会对齐,我认为没有任何24位读取指令。编译器需要为每次读取和存储生成额外的代码。MrAlias和user都有很好的优点,为什么不将它们结合起来呢
typedef union __attribute__((__packed__)) {
int x : 24;
char s[3];
} u3b;
typedef union __attribute__((__packed__)) {
long long x : 56;
char s[7];
} u7b;
对于大量数据,您可以通过这种方式节省一些内存,但由于会导致未对齐的访问,代码几乎肯定会变慢。为了获得最高的效率,您应该将它们扩展到标准整数长度,并对其进行操作(以4或8的倍数读取数组)
那么您仍然会有endianness问题,因此如果您需要与big和little endianness兼容,则有必要使用union的char部分来适应数据不适用的平台(union只适用于一种endianness)。对于另一个端点,您需要以下内容:
int x = myu3b.s[0]|(myu3b.s[1]<<8)|(myu3b.s[2]<<16);
//or
int x = myu3b.s[2]|(myu3b.s[1]<<8)|(myu3b.s[0]<<16);
int x=myu3b.s[0]|(myu3b.s[1]我完全支持位集方法,请注意对齐问题。如果您进行大量随机访问,可能需要确保与缓存+cpu体系结构对齐
此外,我建议研究另一种方法:
您可以使用例如zlib动态解压缩所需的数据。如果您希望流中存在大量重复值,这可以显著减少IO流量和内存占用。(假设对随机访问的需求不太大。)关于zlib的快速教程。我认为重要的问题是,您是否需要同时访问所有数据
如果您只需要同时访问一块数据
如果您一次只需要访问一个数组,那么python的一种可能性是使用数据类型为uint8
且宽度为所需的NumPy数组。当您需要对数据进行操作时,您可以将压缩数据扩展为(这里的3个八位组数为uint32
):
然后对扩展的执行操作,扩展的是Nuint32
值的一维向量
完成后,可以将数据保存回:
# recompress
compressed[:] = expanded.view('uint8').reshape(-1,4)[:,:3]
对于上面的示例,每个方向所花费的时间(在我的Python机器中)大约为每个元素8 ns。在这里使用Cython可能不会带来太多性能优势,因为几乎所有的时间都花在NumPy黑暗深处的某个缓冲区之间复制数据上
这是一个很高的一次性成本,但是如果您计划至少访问每个元素一次,那么支付一次性成本可能比支付每个操作的类似成本要便宜
当然,在C中也可以采用相同的方法:
#include <stdlib.h>
#include <stdio.h>
#include <time.h>
#include <sys/resource.h>
#define NUMITEMS 10000000
int main(void)
{
uint32_t *expanded;
uint8_t * cmpressed, *exp_as_octets;
struct rusage ru0, ru1;
uint8_t *ep, *cp, *end;
double time_delta;
// create some compressed data
cmpressed = (uint8_t *)malloc(NUMITEMS * 3);
getrusage(RUSAGE_SELF, &ru0);
// allocate the buffer and copy the data
exp_as_octets = (uint8_t *)malloc(NUMITEMS * 4);
end = exp_as_octets + NUMITEMS * 4;
ep = exp_as_octets;
cp = cmpressed;
while (ep < end)
{
// copy three octets out of four
*ep++ = *cp++;
*ep++ = *cp++;
*ep++ = *cp++;
*ep++ = 0;
}
expanded = (uint32_t *)exp_as_octets;
getrusage(RUSAGE_SELF, &ru1);
printf("Uncompress\n");
time_delta = ru1.ru_utime.tv_sec + ru1.ru_utime.tv_usec * 1e-6
- ru0.ru_utime.tv_sec - ru0.ru_utime.tv_usec * 1e-6;
printf("User: %.6lf seconds, %.2lf nanoseconds per element", time_delta, 1e9 * time_delta / NUMITEMS);
time_delta = ru1.ru_stime.tv_sec + ru1.ru_stime.tv_usec * 1e-6
- ru0.ru_stime.tv_sec - ru0.ru_stime.tv_usec * 1e-6;
printf("System: %.6lf seconds, %.2lf nanoseconds per element", time_delta, 1e9 * time_delta / NUMITEMS);
getrusage(RUSAGE_SELF, &ru0);
// compress back
ep = exp_as_octets;
cp = cmpressed;
while (ep < end)
{
*cp++ = *ep++;
*cp++ = *ep++;
*cp++ = *ep++;
ep++;
}
getrusage(RUSAGE_SELF, &ru1);
printf("Compress\n");
time_delta = ru1.ru_utime.tv_sec + ru1.ru_utime.tv_usec * 1e-6
- ru0.ru_utime.tv_sec - ru0.ru_utime.tv_usec * 1e-6;
printf("User: %.6lf seconds, %.2lf nanoseconds per element", time_delta, 1e9 * time_delta / NUMITEMS);
time_delta = ru1.ru_stime.tv_sec + ru1.ru_stime.tv_usec * 1e-6
- ru0.ru_stime.tv_sec - ru0.ru_stime.tv_usec * 1e-6;
printf("System: %.6lf seconds, %.2lf nanoseconds per element", time_delta, 1e9 * time_delta / NUMITEMS);
}
代码是用gcc-Ofast
编译的,可能比较接近最佳速度。系统时间是用malloc
编译的。在我看来,这看起来相当快,因为我们正在以2-3 GB/s的速度进行内存读取。(这也意味着,虽然使代码多线程化会很容易,但可能不会有太多的速度优势。)
如果想要获得最佳性能,需要分别为每个数据宽度编写压缩/解压缩例程。(我不保证上面的C代码在任何机器上都绝对是最快的,我没有看机器代码。)
如果需要随机访问单独的值
相反,如果您只需要在这里访问一个值,在那里访问另一个值,Python将不会提供任何快速的方法,因为数组查找开销巨大
在这个c
#include <stdlib.h>
#include <stdio.h>
#include <time.h>
#include <sys/resource.h>
#define NUMITEMS 10000000
int main(void)
{
uint32_t *expanded;
uint8_t * cmpressed, *exp_as_octets;
struct rusage ru0, ru1;
uint8_t *ep, *cp, *end;
double time_delta;
// create some compressed data
cmpressed = (uint8_t *)malloc(NUMITEMS * 3);
getrusage(RUSAGE_SELF, &ru0);
// allocate the buffer and copy the data
exp_as_octets = (uint8_t *)malloc(NUMITEMS * 4);
end = exp_as_octets + NUMITEMS * 4;
ep = exp_as_octets;
cp = cmpressed;
while (ep < end)
{
// copy three octets out of four
*ep++ = *cp++;
*ep++ = *cp++;
*ep++ = *cp++;
*ep++ = 0;
}
expanded = (uint32_t *)exp_as_octets;
getrusage(RUSAGE_SELF, &ru1);
printf("Uncompress\n");
time_delta = ru1.ru_utime.tv_sec + ru1.ru_utime.tv_usec * 1e-6
- ru0.ru_utime.tv_sec - ru0.ru_utime.tv_usec * 1e-6;
printf("User: %.6lf seconds, %.2lf nanoseconds per element", time_delta, 1e9 * time_delta / NUMITEMS);
time_delta = ru1.ru_stime.tv_sec + ru1.ru_stime.tv_usec * 1e-6
- ru0.ru_stime.tv_sec - ru0.ru_stime.tv_usec * 1e-6;
printf("System: %.6lf seconds, %.2lf nanoseconds per element", time_delta, 1e9 * time_delta / NUMITEMS);
getrusage(RUSAGE_SELF, &ru0);
// compress back
ep = exp_as_octets;
cp = cmpressed;
while (ep < end)
{
*cp++ = *ep++;
*cp++ = *ep++;
*cp++ = *ep++;
ep++;
}
getrusage(RUSAGE_SELF, &ru1);
printf("Compress\n");
time_delta = ru1.ru_utime.tv_sec + ru1.ru_utime.tv_usec * 1e-6
- ru0.ru_utime.tv_sec - ru0.ru_utime.tv_usec * 1e-6;
printf("User: %.6lf seconds, %.2lf nanoseconds per element", time_delta, 1e9 * time_delta / NUMITEMS);
time_delta = ru1.ru_stime.tv_sec + ru1.ru_stime.tv_usec * 1e-6
- ru0.ru_stime.tv_sec - ru0.ru_stime.tv_usec * 1e-6;
printf("System: %.6lf seconds, %.2lf nanoseconds per element", time_delta, 1e9 * time_delta / NUMITEMS);
}
Uncompress
User: 0.022650 seconds, 2.27 nanoseconds per element
System: 0.016171 seconds, 1.62 nanoseconds per element
Compress
User: 0.011698 seconds, 1.17 nanoseconds per element
System: 0.000018 seconds, 0.00 nanoseconds per element
value = (uint32_t *)&compressed[3 * n] & 0x00ffffff;
Arrays of 800 million entries -- not using bit-field
With 'flex' arrays of 10.4G bytes: took 20.160 secs: user 16.600 system 3.500
With simple arrays of 13.4G bytes: took 32.580 secs: user 14.680 system 4.910
Arrays of 800 million entries -- using bit-field
With 'flex' arrays of 10.4G bytes: took 22.280 secs: user 18.820 system 3.380
With simple arrays of 13.4G bytes: took 20.450 secs: user 14.450 system 4.620
/*==============================================================================
* 2/3/4/5/... byte "integers" and arrays thereof.
*/
#include <stdint.h>
#include <stdbool.h>
#include <stdlib.h>
#include <stddef.h>
#include <unistd.h>
#include <memory.h>
#include <stdio.h>
#include <sys/times.h>
#include <assert.h>
/*==============================================================================
* General options
*/
#define BIT_FIELD 0 /* use bit-fields (or not) */
#include <endian.h>
#include <byteswap.h>
#if __BYTE_ORDER == __LITTLE_ENDIAN
# define htole16(x) (x)
# define le16toh(x) (x)
# define htole32(x) (x)
# define le32toh(x) (x)
# define htole64(x) (x)
# define le64toh(x) (x)
#else
# define htole16(x) __bswap_16 (x)
# define le16toh(x) __bswap_16 (x)
# define htole32(x) __bswap_32 (x)
# define le32toh(x) __bswap_32 (x)
# define htole64(x) __bswap_64 (x)
# define le64toh(x) __bswap_64 (x)
#endif
typedef int64_t imax_t ;
/*------------------------------------------------------------------------------
* 2 byte integer
*/
#if BIT_FIELD
typedef struct __attribute__((packed)) { int16_t i : 2 * 8 ; } iflex_2b_t ;
#else
typedef struct { int8_t b[2] ; } iflex_2b_t ;
#endif
inline static int16_t
iflex_get_2b(iflex_2b_t item)
{
#if BIT_FIELD
return item.i ;
#else
union
{
int16_t i ;
iflex_2b_t f ;
} x ;
x.f = item ;
return le16toh(x.i) ;
#endif
} ;
inline static iflex_2b_t
iflex_put_2b(int16_t val)
{
#if BIT_FIELD
iflex_2b_t x ;
x.i = val ;
return x ;
#else
union
{
int16_t i ;
iflex_2b_t f ;
} x ;
x.i = htole16(val) ;
return x.f ;
#endif
} ;
/*------------------------------------------------------------------------------
* 3 byte integer
*/
#if BIT_FIELD
typedef struct __attribute__((packed)) { int32_t i : 3 * 8 ; } iflex_3b_t ;
#else
typedef struct { int8_t b[3] ; } iflex_3b_t ;
#endif
inline static int32_t
iflex_get_3b(iflex_3b_t item)
{
#if BIT_FIELD
return item.i ;
#else
union
{
int32_t i ;
int16_t s[2] ;
iflex_2b_t t[2] ;
} x ;
x.t[0] = *((iflex_2b_t*)&item) ;
x.s[1] = htole16(item.b[2]) ;
return le32toh(x.i) ;
#endif
} ;
inline static iflex_3b_t
iflex_put_3b(int32_t val)
{
#if BIT_FIELD
iflex_3b_t x ;
x.i = val ;
return x ;
#else
union
{
int32_t i ;
iflex_3b_t f ;
} x ;
x.i = htole32(val) ;
return x.f ;
#endif
} ;
/*------------------------------------------------------------------------------
* 4 byte integer
*/
#if BIT_FIELD
typedef struct __attribute__((packed)) { int32_t i : 4 * 8 ; } iflex_4b_t ;
#else
typedef struct { int8_t b[4] ; } iflex_4b_t ;
#endif
inline static int32_t
iflex_get_4b(iflex_4b_t item)
{
#if BIT_FIELD
return item.i ;
#else
union
{
int32_t i ;
iflex_4b_t f ;
} x ;
x.f = item ;
return le32toh(x.i) ;
#endif
} ;
inline static iflex_4b_t
iflex_put_4b(int32_t val)
{
#if BIT_FIELD
iflex_4b_t x ;
x.i = val ;
return x ;
#else
union
{
int32_t i ;
iflex_4b_t f ;
} x ;
x.i = htole32((int32_t)val) ;
return x.f ;
#endif
} ;
/*------------------------------------------------------------------------------
* 5 byte integer
*/
#if BIT_FIELD
typedef struct __attribute__((packed)) { int64_t i : 5 * 8 ; } iflex_5b_t ;
#else
typedef struct { int8_t b[5] ; } iflex_5b_t ;
#endif
inline static int64_t
iflex_get_5b(iflex_5b_t item)
{
#if BIT_FIELD
return item.i ;
#else
union
{
int64_t i ;
int32_t s[2] ;
iflex_4b_t t[2] ;
} x ;
x.t[0] = *((iflex_4b_t*)&item) ;
x.s[1] = htole32(item.b[4]) ;
return le64toh(x.i) ;
#endif
} ;
inline static iflex_5b_t
iflex_put_5b(int64_t val)
{
#if BIT_FIELD
iflex_5b_t x ;
x.i = val ;
return x ;
#else
union
{
int64_t i ;
iflex_5b_t f ;
} x ;
x.i = htole64(val) ;
return x.f ;
#endif
} ;
/*------------------------------------------------------------------------------
*
*/
#define alignof(t) __alignof__(t)
/*==============================================================================
* To begin at the beginning...
*/
int
main(int argc, char* argv[])
{
int count = 800 ;
assert(sizeof(iflex_2b_t) == 2) ;
assert(alignof(iflex_2b_t) == 1) ;
assert(sizeof(iflex_3b_t) == 3) ;
assert(alignof(iflex_3b_t) == 1) ;
assert(sizeof(iflex_4b_t) == 4) ;
assert(alignof(iflex_4b_t) == 1) ;
assert(sizeof(iflex_5b_t) == 5) ;
assert(alignof(iflex_5b_t) == 1) ;
clock_t at_start_clock, at_end_clock ;
struct tms at_start_tms, at_end_tms ;
clock_t ticks ;
printf("Arrays of %d million entries -- %susing bit-field\n", count,
BIT_FIELD ? "" : "not ") ;
count *= 1000000 ;
iflex_2b_t* arr2 = malloc(count * sizeof(iflex_2b_t)) ;
iflex_3b_t* arr3 = malloc(count * sizeof(iflex_3b_t)) ;
iflex_4b_t* arr4 = malloc(count * sizeof(iflex_4b_t)) ;
iflex_5b_t* arr5 = malloc(count * sizeof(iflex_5b_t)) ;
size_t bytes = ((size_t)count * (2 + 3 + 4 + 5)) ;
srand(314159) ;
at_start_clock = times(&at_start_tms) ;
for (int i = 0 ; i < count ; i++)
{
imax_t v5, v4, v3, v2, r ;
v2 = (int16_t)(rand() % 0x10000) ;
arr2[i] = iflex_put_2b(v2) ;
v3 = (v2 * 0x100) | ((i & 0xFF) ^ 0x33) ;
arr3[i] = iflex_put_3b(v3) ;
v4 = (v3 * 0x100) | ((i & 0xFF) ^ 0x44) ;
arr4[i] = iflex_put_4b(v4) ;
v5 = (v4 * 0x100) | ((i & 0xFF) ^ 0x55) ;
arr5[i] = iflex_put_5b(v5) ;
r = iflex_get_2b(arr2[i]) ;
assert(r == v2) ;
r = iflex_get_3b(arr3[i]) ;
assert(r == v3) ;
r = iflex_get_4b(arr4[i]) ;
assert(r == v4) ;
r = iflex_get_5b(arr5[i]) ;
assert(r == v5) ;
} ;
for (int i = count - 1 ; i >= 0 ; i--)
{
imax_t v5, v4, v3, v2, r, b ;
v5 = iflex_get_5b(arr5[i]) ;
b = (i & 0xFF) ^ 0x55 ;
assert((v5 & 0xFF) == b) ;
r = (v5 ^ b) / 0x100 ;
v4 = iflex_get_4b(arr4[i]) ;
assert(v4 == r) ;
b = (i & 0xFF) ^ 0x44 ;
assert((v4 & 0xFF) == b) ;
r = (v4 ^ b) / 0x100 ;
v3 = iflex_get_3b(arr3[i]) ;
assert(v3 == r) ;
b = (i & 0xFF) ^ 0x33 ;
assert((v3 & 0xFF) == b) ;
r = (v3 ^ b) / 0x100 ;
v2 = iflex_get_2b(arr2[i]) ;
assert(v2 == r) ;
} ;
at_end_clock = times(&at_end_tms) ;
ticks = sysconf(_SC_CLK_TCK) ;
printf("With 'flex' arrays of %4.1fG bytes: "
"took %5.3f secs: user %5.3f system %5.3f\n",
(double)bytes / (double)(1024 *1024 *1024),
(double)(at_end_clock - at_start_clock) / (double)ticks,
(double)(at_end_tms.tms_utime - at_start_tms.tms_utime) / (double)ticks,
(double)(at_end_tms.tms_stime - at_start_tms.tms_stime) / (double)ticks) ;
free(arr2) ;
free(arr3) ;
free(arr4) ;
free(arr5) ;
int16_t* brr2 = malloc(count * sizeof(int16_t)) ;
int32_t* brr3 = malloc(count * sizeof(int32_t)) ;
int32_t* brr4 = malloc(count * sizeof(int32_t)) ;
int64_t* brr5 = malloc(count * sizeof(int64_t)) ;
bytes = ((size_t)count * (2 + 4 + 4 + 8)) ;
srand(314159) ;
at_start_clock = times(&at_start_tms) ;
for (int i = 0 ; i < count ; i++)
{
imax_t v5, v4, v3, v2, r ;
v2 = (int16_t)(rand() % 0x10000) ;
brr2[i] = v2 ;
v3 = (v2 * 0x100) | ((i & 0xFF) ^ 0x33) ;
brr3[i] = v3 ;
v4 = (v3 * 0x100) | ((i & 0xFF) ^ 0x44) ;
brr4[i] = v4 ;
v5 = (v4 * 0x100) | ((i & 0xFF) ^ 0x55) ;
brr5[i] = v5 ;
r = brr2[i] ;
assert(r == v2) ;
r = brr3[i] ;
assert(r == v3) ;
r = brr4[i] ;
assert(r == v4) ;
r = brr5[i] ;
assert(r == v5) ;
} ;
for (int i = count - 1 ; i >= 0 ; i--)
{
imax_t v5, v4, v3, v2, r, b ;
v5 = brr5[i] ;
b = (i & 0xFF) ^ 0x55 ;
assert((v5 & 0xFF) == b) ;
r = (v5 ^ b) / 0x100 ;
v4 = brr4[i] ;
assert(v4 == r) ;
b = (i & 0xFF) ^ 0x44 ;
assert((v4 & 0xFF) == b) ;
r = (v4 ^ b) / 0x100 ;
v3 = brr3[i] ;
assert(v3 == r) ;
b = (i & 0xFF) ^ 0x33 ;
assert((v3 & 0xFF) == b) ;
r = (v3 ^ b) / 0x100 ;
v2 = brr2[i] ;
assert(v2 == r) ;
} ;
at_end_clock = times(&at_end_tms) ;
printf("With simple arrays of %4.1fG bytes: "
"took %5.3f secs: user %5.3f system %5.3f\n",
(double)bytes / (double)(1024 *1024 *1024),
(double)(at_end_clock - at_start_clock) / (double)ticks,
(double)(at_end_tms.tms_utime - at_start_tms.tms_utime) / (double)ticks,
(double)(at_end_tms.tms_stime - at_start_tms.tms_stime) / (double)ticks) ;
free(brr2) ;
free(brr3) ;
free(brr4) ;
free(brr5) ;
return 0 ;
} ;