如何在普通C中检测UTF-8?
我正在寻找一个纯旧C语言的代码片段,它检测给定字符串是否采用UTF-8编码。我知道正则表达式的解决方案,但出于各种原因,在这种特殊情况下最好避免使用纯C以外的任何东西 使用正则表达式的解决方案如下所示(警告:省略了各种检查):如何在普通C中检测UTF-8?,c,utf-8,C,Utf 8,我正在寻找一个纯旧C语言的代码片段,它检测给定字符串是否采用UTF-8编码。我知道正则表达式的解决方案,但出于各种原因,在这种特殊情况下最好避免使用纯C以外的任何东西 使用正则表达式的解决方案如下所示(警告:省略了各种检查): 您必须将字符串解析为UTF-8,因为它非常简单。如果解析失败,则不是UTF-8。有几个简单的UTF-8库可以做到这一点 如果您知道字符串是普通的旧ASCII或它包含ASCII之外的UTF-8编码字符,那么它可能会被简化。在这种情况下,您通常不需要考虑差异,UTF-8的设计
您必须将字符串解析为UTF-8,因为它非常简单。如果解析失败,则不是UTF-8。有几个简单的UTF-8库可以做到这一点 如果您知道字符串是普通的旧ASCII或它包含ASCII之外的UTF-8编码字符,那么它可能会被简化。在这种情况下,您通常不需要考虑差异,UTF-8的设计是,可以处理ASCII的现有程序在大多数情况下可以透明地处理UTF-8 请记住,ASCII是以UTF-8编码的,因此ASCII是有效的UTF-8 C字符串可以是任何东西,您需要解决的问题是,您不知道内容是ASCII、GB 2312、CP437、UTF-16还是其他十几个字符编码中的任何一种,这些编码使程序的使用变得很困难 您可以使用。它在通用字符集检测器中发现,它几乎是C++库的一个分支。要找到识别UTF-8的类并只使用它应该非常容易。
这个类的基本功能是检测UTF-8特有的字符序列
- 获取最新的firefox主干
- 转到\mozilla\extensions\UniversalCharet\
- 查找UTF-8探测器类(我不太记得它的确切名称)
U+0000-U+007F 0xxxxxxx
U+0080-U+07FF 110yyyxx 10xxxxxx
U+0800-U+FFFF 1110yyyy 10yyyyxx 10xxxxxx
U+10000-U+10FFFF 11110zzz 10zzyyyy 10yyyyxx 10xxxxxx
下面是一个(希望没有bug)的纯C实现:
_Bool is_utf8(const char * string)
{
if(!string)
return 0;
const unsigned char * bytes = (const unsigned char *)string;
while(*bytes)
{
if( (// ASCII
// use bytes[0] <= 0x7F to allow ASCII control characters
bytes[0] == 0x09 ||
bytes[0] == 0x0A ||
bytes[0] == 0x0D ||
(0x20 <= bytes[0] && bytes[0] <= 0x7E)
)
) {
bytes += 1;
continue;
}
if( (// non-overlong 2-byte
(0xC2 <= bytes[0] && bytes[0] <= 0xDF) &&
(0x80 <= bytes[1] && bytes[1] <= 0xBF)
)
) {
bytes += 2;
continue;
}
if( (// excluding overlongs
bytes[0] == 0xE0 &&
(0xA0 <= bytes[1] && bytes[1] <= 0xBF) &&
(0x80 <= bytes[2] && bytes[2] <= 0xBF)
) ||
(// straight 3-byte
((0xE1 <= bytes[0] && bytes[0] <= 0xEC) ||
bytes[0] == 0xEE ||
bytes[0] == 0xEF) &&
(0x80 <= bytes[1] && bytes[1] <= 0xBF) &&
(0x80 <= bytes[2] && bytes[2] <= 0xBF)
) ||
(// excluding surrogates
bytes[0] == 0xED &&
(0x80 <= bytes[1] && bytes[1] <= 0x9F) &&
(0x80 <= bytes[2] && bytes[2] <= 0xBF)
)
) {
bytes += 3;
continue;
}
if( (// planes 1-3
bytes[0] == 0xF0 &&
(0x90 <= bytes[1] && bytes[1] <= 0xBF) &&
(0x80 <= bytes[2] && bytes[2] <= 0xBF) &&
(0x80 <= bytes[3] && bytes[3] <= 0xBF)
) ||
(// planes 4-15
(0xF1 <= bytes[0] && bytes[0] <= 0xF3) &&
(0x80 <= bytes[1] && bytes[1] <= 0xBF) &&
(0x80 <= bytes[2] && bytes[2] <= 0xBF) &&
(0x80 <= bytes[3] && bytes[3] <= 0xBF)
) ||
(// plane 16
bytes[0] == 0xF4 &&
(0x80 <= bytes[1] && bytes[1] <= 0x8F) &&
(0x80 <= bytes[2] && bytes[2] <= 0xBF) &&
(0x80 <= bytes[3] && bytes[3] <= 0xBF)
)
) {
bytes += 4;
continue;
}
return 0;
}
return 1;
}
\u Bool是\u utf8(常量字符*字符串)
{
如果(!字符串)
返回0;
常量无符号字符*字节=(常量无符号字符*)字符串;
而(*字节)
{
if(//ASCII
//使用字节[0]这个由Bjoern Hoermann设计的解码器是我发现的最简单的解码器。它还可以通过向它提供一个字节以及保持一个状态来工作。该状态对于解析通过网络分块输入的UTF8非常有用
如果文本有效,则返回utf8utf8\u ACCEPT
。如果文本无效,则返回其他整数
以块(例如从网络)形式提供数据的使用示例:
不可能检测到给定的字节数组是UTF-8字符串。您可以可靠地确定它不可能是有效的UTF-8(这并不意味着它不是无效的UTF-8);您可以确定它可能是有效的UTF-8序列,但可能会出现误报
举一个简单的例子,使用随机数生成器生成一个由3个随机字节组成的数组,并使用它来测试代码。这些是随机字节,因此不是UTF-8,因此代码认为“可能是UTF-8”的每个字符串都是误报。我猜(在这些情况下)您的代码在12%的时间内都会出错
一旦你意识到这是不可能的,你可以开始考虑返回一个置信水平(除了你的预测)。例如,你的函数可能会返回类似“我88%确定这是UTF-8”的内容
现在对所有其他类型的数据执行此操作。例如,您可能有一个函数,用于检查数据是否为UTF-16,该函数可能会返回“我95%确信这是UTF-16”,然后确定(因为95%高于88%)数据更有可能是UTF-16而不是UTF-8
下一步是添加技巧以提高置信度。例如,如果字符串似乎主要包含由空格分隔的有效音节组,则您可以更确信它实际上是UTF-8。同样,如果数据可能是HTML,则可以检查可能是有效HTML标记的内容,并使用t增加你的自信
当然,这同样适用于其他类型的数据。例如,如果数据具有有效的PE32或ELF头,或者具有正确的BMP或JPG或MP3头,那么您可以更加确信它根本不是UTF-8
更好的方法是修复问题的实际原因。例如,可以在您关心的所有文件的开头添加某种“文档类型”标识符,或者说“此软件采用UTF-8,不支持任何其他内容”;这样您就不必在第一时间做出不可靠的猜测。根据我的计算,3个随机字节似乎有15.8%的机会成为有效的UTF-8:
128^3可能的仅ASCII序列=2097152
2^16-2^11可能的3字节UTF-8字符(假设允许代理项对和非字符)=63488
ASCII字符之前或之后的1920个2字节UTF-8字符=1920*128*2=524288
除以3字节序列的数量=(2097152+63488+491520)/16777216.0=0.1580810546875
这大大高估了不正确匹配的数量,因为文件只有3个字节长。随着字节数的增加,交集逐渐减少。此外,非UTF-8中的实际文本不是随机的,存在大量高位设置的孤立字节,这是无效的UTF-8
猜测失败几率的一个更有用的指标是具有高位集的字节序列有效UTF-8的可能性。我得到以下值:
1 byte = 0% # the really important number that is often ignored
2 byte = 11.7%
3 byte = 3.03% (assumes surrogate halves are valid)
4 byte = 1.76% (includes two 2-byte characters)
尝试寻找一个实际可读的字符串(在任何语言和任何编码中)也是一个有效的UTF-8字符串也是很有用的
// Copyright (c) 2008-2009 Bjoern Hoehrmann <bjoern@hoehrmann.de>
// See http://bjoern.hoehrmann.de/utf-8/decoder/dfa/ for details.
#define UTF8_ACCEPT 0
#define UTF8_REJECT 1
static const uint8_t utf8d[] = {
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, // 00..1f
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, // 20..3f
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, // 40..5f
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, // 60..7f
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9, // 80..9f
7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7, // a0..bf
8,8,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2, // c0..df
0xa,0x3,0x3,0x3,0x3,0x3,0x3,0x3,0x3,0x3,0x3,0x3,0x3,0x4,0x3,0x3, // e0..ef
0xb,0x6,0x6,0x6,0x5,0x8,0x8,0x8,0x8,0x8,0x8,0x8,0x8,0x8,0x8,0x8, // f0..ff
0x0,0x1,0x2,0x3,0x5,0x8,0x7,0x1,0x1,0x1,0x4,0x6,0x1,0x1,0x1,0x1, // s0..s0
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,1,1,1,1,1,0,1,0,1,1,1,1,1,1, // s1..s2
1,2,1,1,1,1,1,2,1,2,1,1,1,1,1,1,1,1,1,1,1,1,1,2,1,1,1,1,1,1,1,1, // s3..s4
1,2,1,1,1,1,1,1,1,2,1,1,1,1,1,1,1,1,1,1,1,1,1,3,1,3,1,1,1,1,1,1, // s5..s6
1,3,1,1,1,1,1,3,1,3,1,1,1,1,1,1,1,3,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // s7..s8
};
uint32_t inline
decode(uint32_t* state, uint32_t* codep, uint32_t byte) {
uint32_t type = utf8d[byte];
*codep = (*state != UTF8_ACCEPT) ?
(byte & 0x3fu) | (*codep << 6) :
(0xff >> type) & (byte);
*state = utf8d[256 + *state*16 + type];
return *state;
}
uint32_t validate_utf8(uint32_t *state, char *str, size_t len) {
size_t i;
uint32_t type;
for (i = 0; i < len; i++) {
// We don't care about the codepoint, so this is
// a simplified version of the decode function.
type = utf8d[(uint8_t)str[i]];
*state = utf8d[256 + (*state) * 16 + type];
if (*state == UTF8_REJECT)
break;
}
return *state;
}
char buf[128];
size_t bytes_read;
uint32_t state = UTF8_ACCEPT;
// Validate the UTF8 data in chunks.
while ((bytes_read = get_new_data(buf, sizeof(buf))) {
if (validate_utf8(&state, buf, bytes_read) == UTF8_REJECT)) {
fprintf(stderr, "Invalid UTF8 data!\n");
return -1;
}
}
// If everything went well we should have proper UTF8,
// the data might instead have ended in the middle of a UTF8
// codepoint.
if (state != UTF8_ACCEPT) {
fprintf(stderr, "Invalid UTF8, incomplete codepoint\n");
}
1 byte = 0% # the really important number that is often ignored
2 byte = 11.7%
3 byte = 3.03% (assumes surrogate halves are valid)
4 byte = 1.76% (includes two 2-byte characters)
#include<stdio.h>
#include<string.h>
/* UTF-8 : BYTE_BITS*/
/* B0_BYTE : 0XXXXXXX */
/* B1_BYTE : 10XXXXXX */
/* B2_BYTE : 110XXXXX */
/* B3_BYTE : 1110XXXX */
/* B4_BYTE : 11110XXX */
/* B5_BYTE : 111110XX */
/* B6_BYTE : 1111110X */
#define B0_BYTE 0x00
#define B1_BYTE 0x80
#define B2_BYTE 0xC0
#define B3_BYTE 0xE0
#define B4_BYTE 0xF0
#define B5_BYTE 0xF8
#define B6_BYTE 0xFC
#define B7_BYTE 0xFE
/* Please tune this as per number of lines input */
#define MAX_UTF8_STR 10
/* 600 is used because 6byteX100chars */
#define MAX_UTF8_CHR 600
void func_find_utf8 (char *ptr_to_str);
void print_non_ascii (int bytes, char *pbyte);
char strbuf[MAX_UTF8_STR][MAX_UTF8_CHR];
int
main (int ac, char *av[])
{
int i = 0;
char no_newln_str[MAX_UTF8_CHR];
i = 0;
printf ("\n\nYou can enter utf-8 string or Q/q to QUIT\n\n");
while (i < MAX_UTF8_STR)
{
fgets (strbuf[i], MAX_UTF8_CHR, stdin);
if (!strlen (strbuf[i]))
break;
if ((strbuf[i][0] == 'Q') || (strbuf[i][0] == 'q'))
break;
strcpy (no_newln_str, strbuf[i]);
no_newln_str[strlen (no_newln_str) - 1] = 0;
func_find_utf8 (no_newln_str);
++i;
}
return 1;
}
void
func_find_utf8 (char *ptr_to_str)
{
int found_non_ascii;
char *pbyte;
pbyte = ptr_to_str;
found_non_ascii = 0;
while (*pbyte)
{
if ((*pbyte & B1_BYTE) == B0_BYTE)
{
pbyte++;
continue;
}
else
{
found_non_ascii = 1;
if ((*pbyte & B7_BYTE) == B6_BYTE)
{
print_non_ascii (6, pbyte);
pbyte += 6;
continue;
}
if ((*pbyte & B6_BYTE) == B5_BYTE)
{
print_non_ascii (5, pbyte);
pbyte += 5;
continue;
}
if ((*pbyte & B5_BYTE) == B4_BYTE)
{
print_non_ascii (4, pbyte);
pbyte += 4;
continue;
}
if ((*pbyte & B4_BYTE) == B3_BYTE)
{
print_non_ascii (3, pbyte);
pbyte += 3;
continue;
}
if ((*pbyte & B3_BYTE) == B2_BYTE)
{
print_non_ascii (2, pbyte);
pbyte += 2;
continue;
}
}
}
if (found_non_ascii)
printf (" These are Non Ascci chars\n");
}
void
print_non_ascii (int bytes, char *pbyte)
{
char store[6];
int i;
memset (store, 0, 6);
memcpy (store, pbyte, bytes);
i = 0;
while (i < bytes)
printf ("%c", store[i++]);
printf ("%c", ' ');
fflush (stdout);
}
/*
** Checks if the given string has all bytes like: 10xxxxxx
** where x is either 0 or 1
*/
static int chars_are_folow_uni(const unsigned char *chars)
{
while (*chars)
{
if ((*chars >> 6) != 0x2)
return (0);
chars++;
}
return (1);
}
int char_is_utf8(const unsigned char *key)
{
int required_len;
if (key[0] >> 7 == 0)
required_len = 1;
else if (key[0] >> 5 == 0x6)
required_len = 2;
else if (key[0] >> 4 == 0xE)
required_len = 3;
else if (key[0] >> 5 == 0x1E)
required_len = 4;
else
return (0);
return (strlen(key) == required_len && chars_are_folow_uni(key + 1));
}
unsigned char buf[5];
ft_to_utf8(L'歓', buf);
printf("%d\n", char_is_utf8(buf)); // => 1
#include <stdint.h>
/**
* Maps the last 5 bits in a byte (0b11111xxx) to a UTF-8 codepoint length.
*
* Codepoint length 0 == error.
*
* The first valid length can be any value between 1 to 4 (5== error).
*
* An intermidiate (second, third or forth) valid length must be 5.
*
* To map was populated using the following Ruby script:
*
* map = []; 32.times { map << 0 }; (0..0b1111).each {|i| map[i] = 1} ;
* (0b10000..0b10111).each {|i| map[i] = 5} ;
* (0b11000..0b11011).each {|i| map[i] = 2} ;
* (0b11100..0b11101).each {|i| map[i] = 3} ;
* map[0b11110] = 4; map;
*/
static uint8_t fio_str_utf8_map[] = {1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 5, 5, 5, 5, 5, 5,
5, 5, 2, 2, 2, 2, 3, 3, 4, 0};
/**
* Advances the `ptr` by one utf-8 character, placing the value of the UTF-8
* character into the i32 variable (which must be a signed integer with 32bits
* or more). On error, `i32` will be equal to `-1` and `ptr` will not step
* forwards.
*
* The `end` value is only used for overflow protection.
*/
#define FIO_STR_UTF8_CODE_POINT(ptr, end, i32) \
switch (fio_str_utf8_map[((uint8_t *)(ptr))[0] >> 3]) { \
case 1: \
(i32) = ((uint8_t *)(ptr))[0]; \
++(ptr); \
break; \
case 2: \
if (((ptr) + 2 > (end)) || \
fio_str_utf8_map[((uint8_t *)(ptr))[1] >> 3] != 5) { \
(i32) = -1; \
break; \
} \
(i32) = \
((((uint8_t *)(ptr))[0] & 31) << 6) | (((uint8_t *)(ptr))[1] & 63); \
(ptr) += 2; \
break; \
case 3: \
if (((ptr) + 3 > (end)) || \
fio_str_utf8_map[((uint8_t *)(ptr))[1] >> 3] != 5 || \
fio_str_utf8_map[((uint8_t *)(ptr))[2] >> 3] != 5) { \
(i32) = -1; \
break; \
} \
(i32) = ((((uint8_t *)(ptr))[0] & 15) << 12) | \
((((uint8_t *)(ptr))[1] & 63) << 6) | \
(((uint8_t *)(ptr))[2] & 63); \
(ptr) += 3; \
break; \
case 4: \
if (((ptr) + 4 > (end)) || \
fio_str_utf8_map[((uint8_t *)(ptr))[1] >> 3] != 5 || \
fio_str_utf8_map[((uint8_t *)(ptr))[2] >> 3] != 5 || \
fio_str_utf8_map[((uint8_t *)(ptr))[3] >> 3] != 5) { \
(i32) = -1; \
break; \
} \
(i32) = ((((uint8_t *)(ptr))[0] & 7) << 18) | \
((((uint8_t *)(ptr))[1] & 63) << 12) | \
((((uint8_t *)(ptr))[2] & 63) << 6) | \
(((uint8_t *)(ptr))[3] & 63); \
(ptr) += 4; \
break; \
default: \
(i32) = -1; \
break; \
}
/** Returns 1 if the String is UTF-8 valid and 0 if not. */
inline static size_t fio_str_utf8_valid2(char const *str, size_t length) {
if (!str)
return 0;
if (!length)
return 1;
const char *const end = str + length;
int32_t c = 0;
do {
FIO_STR_UTF8_CODE_POINT(str, end, c);
} while (c > 0 && str < end);
return str == end && c >= 0;
}
#define __FALSE (0)
#define __TRUE (!__FALSE)
#define MS1BITCNT_0_IS_0xxxxxxx_NO_SUCCESSOR (0)
#define MS1BITCNT_1_IS_10xxxxxx_IS_SUCCESSOR (1)
#define MS1BITCNT_2_IS_110xxxxx_HAS_1_SUCCESSOR (2)
#define MS1BITCNT_3_IS_1110xxxx_HAS_2_SUCCESSORS (3)
#define MS1BITCNT_4_IS_11110xxx_HAS_3_SUCCESSORS (4)
typedef int __BOOL;
int CountMS1BitSequenceAndForward(const char **p) {
int Mask;
int Result = 0;
char c = **p;
++(*p);
for (Mask=0x80;c&(Mask&0xFF);Mask>>=1,++Result);
return Result;
}
int MS1BitSequenceCount2SuccessorByteCount(int MS1BitSeqCount) {
switch (MS1BitSeqCount) {
case MS1BITCNT_2_IS_110xxxxx_HAS_1_SUCCESSOR: return 1;
case MS1BITCNT_3_IS_1110xxxx_HAS_2_SUCCESSORS: return 2;
case MS1BITCNT_4_IS_11110xxx_HAS_3_SUCCESSORS: return 3;
}
return 0;
}
__BOOL ExpectUTF8SuccessorCharsOrReturnFalse(const char **Str, int NumberOfCharsToExpect) {
while (NumberOfCharsToExpect--) {
if (CountMS1BitSequenceAndForward(Str) != MS1BITCNT_1_IS_10xxxxxx_IS_SUCCESSOR) {
return __FALSE;
}
}
return __TRUE;
}
__BOOL IsMS1BitSequenceCountAValidUTF8Starter(int Number) {
switch (Number) {
case MS1BITCNT_0_IS_0xxxxxxx_NO_SUCCESSOR:
case MS1BITCNT_2_IS_110xxxxx_HAS_1_SUCCESSOR:
case MS1BITCNT_3_IS_1110xxxx_HAS_2_SUCCESSORS:
case MS1BITCNT_4_IS_11110xxx_HAS_3_SUCCESSORS:
return __TRUE;
}
return __FALSE;
}
#define NO_FURTHER_CHECKS_REQUIRED_IT_IS_NOT_UTF8 (-1)
#define NOT_ALL_EXPECTED_SUCCESSORS_ARE_10xxxxxx (-1)
int CountValidUTF8CharactersOrNegativeOnBadUTF8(const char *Str) {
int NumberOfValidUTF8Sequences = 0;
if (!Str || !Str[0]) { return 0; }
while (*Str) {
int MS1BitSeqCount = CountMS1BitSequenceAndForward(&Str);
if (!IsMS1BitSequenceCountAValidUTF8Starter(MS1BitSeqCount)) {
return NO_FURTHER_CHECKS_REQUIRED_IT_IS_NOT_UTF8;
}
if (!ExpectUTF8SuccessorCharsOrReturnFalse(&Str, MS1BitSequenceCount2SuccessorByteCount(MS1BitSeqCount))) {
return NOT_ALL_EXPECTED_SUCCESSORS_ARE_10xxxxxx;
}
if (MS1BitSeqCount) { ++NumberOfValidUTF8Sequences; }
}
return NumberOfValidUTF8Sequences;
}
static void TestUTF8CheckOrDie(const char *Str, int ExpectedResult) {
int Result = CountValidUTF8CharactersOrNegativeOnBadUTF8(Str);
if (Result != ExpectedResult) {
printf("TEST FAILED: %s:%i: check on '%s' returned %i, but expected was %i\n", __FILE__, __LINE__, Str, Result, ExpectedResult);
exit(1);
}
}
void SimpleUTF8TestCases(void) {
TestUTF8CheckOrDie("abcd89234", 0); // neither valid nor invalid UTF8 sequences
TestUTF8CheckOrDie("", 0); // neither valid nor invalid UTF8 sequences
TestUTF8CheckOrDie(NULL, 0);
TestUTF8CheckOrDie("asdföadkg", 1); // contains one valid UTF8 character sequence
TestUTF8CheckOrDie("asdföadäkg", 2); // contains two valid UTF8 character sequences
TestUTF8CheckOrDie("asdf\xF8" "adäkg", -1); // contains at least one invalid UTF8 sequence
}