如何在普通C中检测UTF-8？_C_Utf 8

如何在普通C中检测UTF-8？

c utf-8

如何在普通C中检测UTF-8？,c,utf-8,C,Utf 8,我正在寻找一个纯旧C语言的代码片段，它检测给定字符串是否采用UTF-8编码。我知道正则表达式的解决方案，但出于各种原因，在这种特殊情况下最好避免使用纯C以外的任何东西使用正则表达式的解决方案如下所示（警告：省略了各种检查）：您必须将字符串解析为UTF-8，因为它非常简单。如果解析失败，则不是UTF-8。有几个简单的UTF-8库可以做到这一点如果您知道字符串是普通的旧ASCII或它包含ASCII之外的UTF-8编码字符，那么它可能会被简化。在这种情况下，您通常不需要考虑差异，UTF-8的设计

我正在寻找一个纯旧C语言的代码片段，它检测给定字符串是否采用UTF-8编码。我知道正则表达式的解决方案，但出于各种原因，在这种特殊情况下最好避免使用纯C以外的任何东西

使用正则表达式的解决方案如下所示（警告：省略了各种检查）：

您必须将字符串解析为UTF-8，因为它非常简单。如果解析失败，则不是UTF-8。有几个简单的UTF-8库可以做到这一点

如果您知道字符串是普通的旧ASCII或它包含ASCII之外的UTF-8编码字符，那么它可能会被简化。在这种情况下，您通常不需要考虑差异，UTF-8的设计是，可以处理ASCII的现有程序在大多数情况下可以透明地处理UTF-8
请记住，ASCII是以UTF-8编码的，因此ASCII是有效的UTF-8
C字符串可以是任何东西，您需要解决的问题是，您不知道内容是ASCII、GB 2312、CP437、UTF-16还是其他十几个字符编码中的任何一种，这些编码使程序的使用变得很困难
您可以使用。它在通用字符集检测器中发现，它几乎是C++库的一个分支。要找到识别UTF-8的类并只使用它应该非常容易。
这个类的基本功能是检测UTF-8特有的字符序列

获取最新的firefox主干

转到\mozilla\extensions\UniversalCharet\

查找UTF-8探测器类（我不太记得它的确切名称）

您无法检测给定字符串（或字节序列）是否为UTF-8编码文本，例如，每个UTF-8八位字节序列也是一个有效（如果无意义）的拉丁-1（或某些其他编码）八位字节序列。然而，并非所有有效的拉丁-1八位字节序列都是有效的UTF-8序列。因此，您可以排除不符合UTF-8编码模式的字符串：

U+0000-U+007F 0xxxxxxx U+0080-U+07FF 110yyyxx 10xxxxxx U+0800-U+FFFF 1110yyyy 10yyyyxx 10xxxxxx U+10000-U+10FFFF 11110zzz 10zzyyyy 10yyyyxx 10xxxxxx
下面是一个（希望没有bug）的纯C实现：

_Bool is_utf8(const char * string) { if(!string) return 0; const unsigned char * bytes = (const unsigned char *)string; while(*bytes) { if( (// ASCII // use bytes[0] <= 0x7F to allow ASCII control characters bytes[0] == 0x09 || bytes[0] == 0x0A || bytes[0] == 0x0D || (0x20 <= bytes[0] && bytes[0] <= 0x7E) ) ) { bytes += 1; continue; } if( (// non-overlong 2-byte (0xC2 <= bytes[0] && bytes[0] <= 0xDF) && (0x80 <= bytes[1] && bytes[1] <= 0xBF) ) ) { bytes += 2; continue; } if( (// excluding overlongs bytes[0] == 0xE0 && (0xA0 <= bytes[1] && bytes[1] <= 0xBF) && (0x80 <= bytes[2] && bytes[2] <= 0xBF) ) || (// straight 3-byte ((0xE1 <= bytes[0] && bytes[0] <= 0xEC) || bytes[0] == 0xEE || bytes[0] == 0xEF) && (0x80 <= bytes[1] && bytes[1] <= 0xBF) && (0x80 <= bytes[2] && bytes[2] <= 0xBF) ) || (// excluding surrogates bytes[0] == 0xED && (0x80 <= bytes[1] && bytes[1] <= 0x9F) && (0x80 <= bytes[2] && bytes[2] <= 0xBF) ) ) { bytes += 3; continue; } if( (// planes 1-3 bytes[0] == 0xF0 && (0x90 <= bytes[1] && bytes[1] <= 0xBF) && (0x80 <= bytes[2] && bytes[2] <= 0xBF) && (0x80 <= bytes[3] && bytes[3] <= 0xBF) ) || (// planes 4-15 (0xF1 <= bytes[0] && bytes[0] <= 0xF3) && (0x80 <= bytes[1] && bytes[1] <= 0xBF) && (0x80 <= bytes[2] && bytes[2] <= 0xBF) && (0x80 <= bytes[3] && bytes[3] <= 0xBF) ) || (// plane 16 bytes[0] == 0xF4 && (0x80 <= bytes[1] && bytes[1] <= 0x8F) && (0x80 <= bytes[2] && bytes[2] <= 0xBF) && (0x80 <= bytes[3] && bytes[3] <= 0xBF) ) ) { bytes += 4; continue; } return 0; } return 1; }

\u Bool是\u utf8（常量字符*字符串） { 如果（！字符串）返回0；常量无符号字符*字节=（常量无符号字符*）字符串；而（*字节） { if（//ASCII //使用字节[0]这个由Bjoern Hoermann设计的解码器是我发现的最简单的解码器。它还可以通过向它提供一个字节以及保持一个状态来工作。该状态对于解析通过网络分块输入的UTF8非常有用如果文本有效，则返回utf8utf8\u ACCEPT 。如果文本无效，则返回其他整数以块（例如从网络）形式提供数据的使用示例：不可能检测到给定的字节数组是UTF-8字符串。您可以可靠地确定它不可能是有效的UTF-8（这并不意味着它不是无效的UTF-8）；您可以确定它可能是有效的UTF-8序列，但可能会出现误报举一个简单的例子，使用随机数生成器生成一个由3个随机字节组成的数组，并使用它来测试代码。这些是随机字节，因此不是UTF-8，因此代码认为“可能是UTF-8”的每个字符串都是误报。我猜（在这些情况下）您的代码在12%的时间内都会出错一旦你意识到这是不可能的，你可以开始考虑返回一个置信水平（除了你的预测）。例如，你的函数可能会返回类似“我88%确定这是UTF-8”的内容现在对所有其他类型的数据执行此操作。例如，您可能有一个函数，用于检查数据是否为UTF-16，该函数可能会返回“我95%确信这是UTF-16”，然后确定（因为95%高于88%）数据更有可能是UTF-16而不是UTF-8 下一步是添加技巧以提高置信度。例如，如果字符串似乎主要包含由空格分隔的有效音节组，则您可以更确信它实际上是UTF-8。同样，如果数据可能是HTML，则可以检查可能是有效HTML标记的内容，并使用t增加你的自信当然，这同样适用于其他类型的数据。例如，如果数据具有有效的PE32或ELF头，或者具有正确的BMP或JPG或MP3头，那么您可以更加确信它根本不是UTF-8 更好的方法是修复问题的实际原因。例如，可以在您关心的所有文件的开头添加某种“文档类型”标识符，或者说“此软件采用UTF-8，不支持任何其他内容”；这样您就不必在第一时间做出不可靠的猜测。根据我的计算，3个随机字节似乎有15.8%的机会成为有效的UTF-8： 128^3可能的仅ASCII序列=2097152 2^16-2^11可能的3字节UTF-8字符（假设允许代理项对和非字符）=63488 ASCII字符之前或之后的1920个2字节UTF-8字符=1920*128*2=524288 除以3字节序列的数量=（2097152+63488+491520）/16777216.0=0.1580810546875 这大大高估了不正确匹配的数量，因为文件只有3个字节长。随着字节数的增加，交集逐渐减少。此外，非UTF-8中的实际文本不是随机的，存在大量高位设置的孤立字节，这是无效的UTF-8 猜测失败几率的一个更有用的指标是具有高位集的字节序列有效UTF-8的可能性。我得到以下值： 1 byte = 0% # the really important number that is often ignored 2 byte = 11.7% 3 byte = 3.03% (assumes surrogate halves are valid) 4 byte = 1.76% (includes two 2-byte characters) 尝试寻找一个实际可读的字符串（在任何语言和任何编码中）也是一个有效的UTF-8字符串也是很有用的 // Copyright (c) 2008-2009 Bjoern Hoehrmann <bjoern@hoehrmann.de> // See http://bjoern.hoehrmann.de/utf-8/decoder/dfa/ for details. #define UTF8_ACCEPT 0 #define UTF8_REJECT 1 static const uint8_t utf8d[] = { 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, // 00..1f 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, // 20..3f 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, // 40..5f 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, // 60..7f 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9, // 80..9f 7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7, // a0..bf 8,8,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2, // c0..df 0xa,0x3,0x3,0x3,0x3,0x3,0x3,0x3,0x3,0x3,0x3,0x3,0x3,0x4,0x3,0x3, // e0..ef 0xb,0x6,0x6,0x6,0x5,0x8,0x8,0x8,0x8,0x8,0x8,0x8,0x8,0x8,0x8,0x8, // f0..ff 0x0,0x1,0x2,0x3,0x5,0x8,0x7,0x1,0x1,0x1,0x4,0x6,0x1,0x1,0x1,0x1, // s0..s0 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,1,1,1,1,1,0,1,0,1,1,1,1,1,1, // s1..s2 1,2,1,1,1,1,1,2,1,2,1,1,1,1,1,1,1,1,1,1,1,1,1,2,1,1,1,1,1,1,1,1, // s3..s4 1,2,1,1,1,1,1,1,1,2,1,1,1,1,1,1,1,1,1,1,1,1,1,3,1,3,1,1,1,1,1,1, // s5..s6 1,3,1,1,1,1,1,3,1,3,1,1,1,1,1,1,1,3,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // s7..s8 }; uint32_t inline decode(uint32_t* state, uint32_t* codep, uint32_t byte) { uint32_t type = utf8d[byte]; *codep = (*state != UTF8_ACCEPT) ? (byte & 0x3fu) | (*codep << 6) : (0xff >> type) & (byte); *state = utf8d[256 + *state*16 + type]; return *state; } uint32_t validate_utf8(uint32_t *state, char *str, size_t len) { size_t i; uint32_t type; for (i = 0; i < len; i++) { // We don't care about the codepoint, so this is // a simplified version of the decode function. type = utf8d[(uint8_t)str[i]]; *state = utf8d[256 + (*state) * 16 + type]; if (*state == UTF8_REJECT) break; } return *state; } char buf[128]; size_t bytes_read; uint32_t state = UTF8_ACCEPT; // Validate the UTF8 data in chunks. while ((bytes_read = get_new_data(buf, sizeof(buf))) { if (validate_utf8(&state, buf, bytes_read) == UTF8_REJECT)) { fprintf(stderr, "Invalid UTF8 data!\n"); return -1; } } // If everything went well we should have proper UTF8, // the data might instead have ended in the middle of a UTF8 // codepoint. if (state != UTF8_ACCEPT) { fprintf(stderr, "Invalid UTF8, incomplete codepoint\n"); } 1 byte = 0% # the really important number that is often ignored 2 byte = 11.7% 3 byte = 3.03% (assumes surrogate halves are valid) 4 byte = 1.76% (includes two 2-byte characters) #include<stdio.h> #include<string.h> /* UTF-8 : BYTE_BITS*/ /* B0_BYTE : 0XXXXXXX */ /* B1_BYTE : 10XXXXXX */ /* B2_BYTE : 110XXXXX */ /* B3_BYTE : 1110XXXX */ /* B4_BYTE : 11110XXX */ /* B5_BYTE : 111110XX */ /* B6_BYTE : 1111110X */ #define B0_BYTE 0x00 #define B1_BYTE 0x80 #define B2_BYTE 0xC0 #define B3_BYTE 0xE0 #define B4_BYTE 0xF0 #define B5_BYTE 0xF8 #define B6_BYTE 0xFC #define B7_BYTE 0xFE /* Please tune this as per number of lines input */ #define MAX_UTF8_STR 10 /* 600 is used because 6byteX100chars */ #define MAX_UTF8_CHR 600 void func_find_utf8 (char *ptr_to_str); void print_non_ascii (int bytes, char *pbyte); char strbuf[MAX_UTF8_STR][MAX_UTF8_CHR]; int main (int ac, char *av[]) { int i = 0; char no_newln_str[MAX_UTF8_CHR]; i = 0; printf ("\n\nYou can enter utf-8 string or Q/q to QUIT\n\n"); while (i < MAX_UTF8_STR) { fgets (strbuf[i], MAX_UTF8_CHR, stdin); if (!strlen (strbuf[i])) break; if ((strbuf[i][0] == 'Q') || (strbuf[i][0] == 'q')) break; strcpy (no_newln_str, strbuf[i]); no_newln_str[strlen (no_newln_str) - 1] = 0; func_find_utf8 (no_newln_str); ++i; } return 1; } void func_find_utf8 (char *ptr_to_str) { int found_non_ascii; char *pbyte; pbyte = ptr_to_str; found_non_ascii = 0; while (*pbyte) { if ((*pbyte & B1_BYTE) == B0_BYTE) { pbyte++; continue; } else { found_non_ascii = 1; if ((*pbyte & B7_BYTE) == B6_BYTE) { print_non_ascii (6, pbyte); pbyte += 6; continue; } if ((*pbyte & B6_BYTE) == B5_BYTE) { print_non_ascii (5, pbyte); pbyte += 5; continue; } if ((*pbyte & B5_BYTE) == B4_BYTE) { print_non_ascii (4, pbyte); pbyte += 4; continue; } if ((*pbyte & B4_BYTE) == B3_BYTE) { print_non_ascii (3, pbyte); pbyte += 3; continue; } if ((*pbyte & B3_BYTE) == B2_BYTE) { print_non_ascii (2, pbyte); pbyte += 2; continue; } } } if (found_non_ascii) printf (" These are Non Ascci chars\n"); } void print_non_ascii (int bytes, char *pbyte) { char store[6]; int i; memset (store, 0, 6); memcpy (store, pbyte, bytes); i = 0; while (i < bytes) printf ("%c", store[i++]); printf ("%c", ' '); fflush (stdout); } /* ** Checks if the given string has all bytes like: 10xxxxxx ** where x is either 0 or 1 */ static int chars_are_folow_uni(const unsigned char *chars) { while (*chars) { if ((*chars >> 6) != 0x2) return (0); chars++; } return (1); } int char_is_utf8(const unsigned char *key) { int required_len; if (key[0] >> 7 == 0) required_len = 1; else if (key[0] >> 5 == 0x6) required_len = 2; else if (key[0] >> 4 == 0xE) required_len = 3; else if (key[0] >> 5 == 0x1E) required_len = 4; else return (0); return (strlen(key) == required_len && chars_are_folow_uni(key + 1)); } unsigned char buf[5]; ft_to_utf8(L'歓', buf); printf("%d\n", char_is_utf8(buf)); // => 1 #include <stdint.h> /** * Maps the last 5 bits in a byte (0b11111xxx) to a UTF-8 codepoint length. * * Codepoint length 0 == error. * * The first valid length can be any value between 1 to 4 (5== error). * * An intermidiate (second, third or forth) valid length must be 5. * * To map was populated using the following Ruby script: * * map = []; 32.times { map << 0 }; (0..0b1111).each {|i| map[i] = 1} ; * (0b10000..0b10111).each {|i| map[i] = 5} ; * (0b11000..0b11011).each {|i| map[i] = 2} ; * (0b11100..0b11101).each {|i| map[i] = 3} ; * map[0b11110] = 4; map; */ static uint8_t fio_str_utf8_map[] = {1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 5, 5, 5, 5, 5, 5, 5, 5, 2, 2, 2, 2, 3, 3, 4, 0}; /** * Advances the `ptr` by one utf-8 character, placing the value of the UTF-8 * character into the i32 variable (which must be a signed integer with 32bits * or more). On error, `i32` will be equal to `-1` and `ptr` will not step * forwards. * * The `end` value is only used for overflow protection. */ #define FIO_STR_UTF8_CODE_POINT(ptr, end, i32) \ switch (fio_str_utf8_map[((uint8_t *)(ptr))[0] >> 3]) { \ case 1: \ (i32) = ((uint8_t *)(ptr))[0]; \ ++(ptr); \ break; \ case 2: \ if (((ptr) + 2 > (end)) || \ fio_str_utf8_map[((uint8_t *)(ptr))[1] >> 3] != 5) { \ (i32) = -1; \ break; \ } \ (i32) = \ ((((uint8_t *)(ptr))[0] & 31) << 6) | (((uint8_t *)(ptr))[1] & 63); \ (ptr) += 2; \ break; \ case 3: \ if (((ptr) + 3 > (end)) || \ fio_str_utf8_map[((uint8_t *)(ptr))[1] >> 3] != 5 || \ fio_str_utf8_map[((uint8_t *)(ptr))[2] >> 3] != 5) { \ (i32) = -1; \ break; \ } \ (i32) = ((((uint8_t *)(ptr))[0] & 15) << 12) | \ ((((uint8_t *)(ptr))[1] & 63) << 6) | \ (((uint8_t *)(ptr))[2] & 63); \ (ptr) += 3; \ break; \ case 4: \ if (((ptr) + 4 > (end)) || \ fio_str_utf8_map[((uint8_t *)(ptr))[1] >> 3] != 5 || \ fio_str_utf8_map[((uint8_t *)(ptr))[2] >> 3] != 5 || \ fio_str_utf8_map[((uint8_t *)(ptr))[3] >> 3] != 5) { \ (i32) = -1; \ break; \ } \ (i32) = ((((uint8_t *)(ptr))[0] & 7) << 18) | \ ((((uint8_t *)(ptr))[1] & 63) << 12) | \ ((((uint8_t *)(ptr))[2] & 63) << 6) | \ (((uint8_t *)(ptr))[3] & 63); \ (ptr) += 4; \ break; \ default: \ (i32) = -1; \ break; \ } /** Returns 1 if the String is UTF-8 valid and 0 if not. */ inline static size_t fio_str_utf8_valid2(char const *str, size_t length) { if (!str) return 0; if (!length) return 1; const char *const end = str + length; int32_t c = 0; do { FIO_STR_UTF8_CODE_POINT(str, end, c); } while (c > 0 && str < end); return str == end && c >= 0; } #define __FALSE (0) #define __TRUE (!__FALSE) #define MS1BITCNT_0_IS_0xxxxxxx_NO_SUCCESSOR (0) #define MS1BITCNT_1_IS_10xxxxxx_IS_SUCCESSOR (1) #define MS1BITCNT_2_IS_110xxxxx_HAS_1_SUCCESSOR (2) #define MS1BITCNT_3_IS_1110xxxx_HAS_2_SUCCESSORS (3) #define MS1BITCNT_4_IS_11110xxx_HAS_3_SUCCESSORS (4) typedef int __BOOL; int CountMS1BitSequenceAndForward(const char **p) { int Mask; int Result = 0; char c = **p; ++(*p); for (Mask=0x80;c&(Mask&0xFF);Mask>>=1,++Result); return Result; } int MS1BitSequenceCount2SuccessorByteCount(int MS1BitSeqCount) { switch (MS1BitSeqCount) { case MS1BITCNT_2_IS_110xxxxx_HAS_1_SUCCESSOR: return 1; case MS1BITCNT_3_IS_1110xxxx_HAS_2_SUCCESSORS: return 2; case MS1BITCNT_4_IS_11110xxx_HAS_3_SUCCESSORS: return 3; } return 0; } __BOOL ExpectUTF8SuccessorCharsOrReturnFalse(const char **Str, int NumberOfCharsToExpect) { while (NumberOfCharsToExpect--) { if (CountMS1BitSequenceAndForward(Str) != MS1BITCNT_1_IS_10xxxxxx_IS_SUCCESSOR) { return __FALSE; } } return __TRUE; } __BOOL IsMS1BitSequenceCountAValidUTF8Starter(int Number) { switch (Number) { case MS1BITCNT_0_IS_0xxxxxxx_NO_SUCCESSOR: case MS1BITCNT_2_IS_110xxxxx_HAS_1_SUCCESSOR: case MS1BITCNT_3_IS_1110xxxx_HAS_2_SUCCESSORS: case MS1BITCNT_4_IS_11110xxx_HAS_3_SUCCESSORS: return __TRUE; } return __FALSE; } #define NO_FURTHER_CHECKS_REQUIRED_IT_IS_NOT_UTF8 (-1) #define NOT_ALL_EXPECTED_SUCCESSORS_ARE_10xxxxxx (-1) int CountValidUTF8CharactersOrNegativeOnBadUTF8(const char *Str) { int NumberOfValidUTF8Sequences = 0; if (!Str || !Str[0]) { return 0; } while (*Str) { int MS1BitSeqCount = CountMS1BitSequenceAndForward(&Str); if (!IsMS1BitSequenceCountAValidUTF8Starter(MS1BitSeqCount)) { return NO_FURTHER_CHECKS_REQUIRED_IT_IS_NOT_UTF8; } if (!ExpectUTF8SuccessorCharsOrReturnFalse(&Str, MS1BitSequenceCount2SuccessorByteCount(MS1BitSeqCount))) { return NOT_ALL_EXPECTED_SUCCESSORS_ARE_10xxxxxx; } if (MS1BitSeqCount) { ++NumberOfValidUTF8Sequences; } } return NumberOfValidUTF8Sequences; } static void TestUTF8CheckOrDie(const char *Str, int ExpectedResult) { int Result = CountValidUTF8CharactersOrNegativeOnBadUTF8(Str); if (Result != ExpectedResult) { printf("TEST FAILED: %s:%i: check on '%s' returned %i, but expected was %i\n", __FILE__, __LINE__, Str, Result, ExpectedResult); exit(1); } } void SimpleUTF8TestCases(void) { TestUTF8CheckOrDie("abcd89234", 0); // neither valid nor invalid UTF8 sequences TestUTF8CheckOrDie("", 0); // neither valid nor invalid UTF8 sequences TestUTF8CheckOrDie(NULL, 0); TestUTF8CheckOrDie("asdföadkg", 1); // contains one valid UTF8 character sequence TestUTF8CheckOrDie("asdföadäkg", 2); // contains two valid UTF8 character sequences TestUTF8CheckOrDie("asdf\xF8" "adäkg", -1); // contains at least one invalid UTF8 sequence }