Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/c/72.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
如何在普通C中检测UTF-8?_C_Utf 8 - Fatal编程技术网

如何在普通C中检测UTF-8?

如何在普通C中检测UTF-8?,c,utf-8,C,Utf 8,我正在寻找一个纯旧C语言的代码片段,它检测给定字符串是否采用UTF-8编码。我知道正则表达式的解决方案,但出于各种原因,在这种特殊情况下最好避免使用纯C以外的任何东西 使用正则表达式的解决方案如下所示(警告:省略了各种检查): 您必须将字符串解析为UTF-8,因为它非常简单。如果解析失败,则不是UTF-8。有几个简单的UTF-8库可以做到这一点 如果您知道字符串是普通的旧ASCII或它包含ASCII之外的UTF-8编码字符,那么它可能会被简化。在这种情况下,您通常不需要考虑差异,UTF-8的设计

我正在寻找一个纯旧C语言的代码片段,它检测给定字符串是否采用UTF-8编码。我知道正则表达式的解决方案,但出于各种原因,在这种特殊情况下最好避免使用纯C以外的任何东西

使用正则表达式的解决方案如下所示(警告:省略了各种检查):


您必须将字符串解析为UTF-8,因为它非常简单。如果解析失败,则不是UTF-8。有几个简单的UTF-8库可以做到这一点

如果您知道字符串是普通的旧ASCII它包含ASCII之外的UTF-8编码字符,那么它可能会被简化。在这种情况下,您通常不需要考虑差异,UTF-8的设计是,可以处理ASCII的现有程序在大多数情况下可以透明地处理UTF-8

请记住,ASCII是以UTF-8编码的,因此ASCII是有效的UTF-8

C字符串可以是任何东西,您需要解决的问题是,您不知道内容是ASCII、GB 2312、CP437、UTF-16还是其他十几个字符编码中的任何一种,这些编码使程序的使用变得很困难

您可以使用。它在通用字符集检测器中发现,它几乎是C++库的一个分支。要找到识别UTF-8的类并只使用它应该非常容易。
这个类的基本功能是检测UTF-8特有的字符序列

  • 获取最新的firefox主干
  • 转到\mozilla\extensions\UniversalCharet\
  • 查找UTF-8探测器类(我不太记得它的确切名称)

您无法检测给定字符串(或字节序列)是否为UTF-8编码文本,例如,每个UTF-8八位字节序列也是一个有效(如果无意义)的拉丁-1(或某些其他编码)八位字节序列。然而,并非所有有效的拉丁-1八位字节序列都是有效的UTF-8序列。因此,您可以排除不符合UTF-8编码模式的字符串:

U+0000-U+007F    0xxxxxxx
U+0080-U+07FF    110yyyxx    10xxxxxx
U+0800-U+FFFF    1110yyyy    10yyyyxx    10xxxxxx
U+10000-U+10FFFF 11110zzz    10zzyyyy    10yyyyxx    10xxxxxx   
下面是一个(希望没有bug)的纯C实现:

_Bool is_utf8(const char * string)
{
    if(!string)
        return 0;

    const unsigned char * bytes = (const unsigned char *)string;
    while(*bytes)
    {
        if( (// ASCII
             // use bytes[0] <= 0x7F to allow ASCII control characters
                bytes[0] == 0x09 ||
                bytes[0] == 0x0A ||
                bytes[0] == 0x0D ||
                (0x20 <= bytes[0] && bytes[0] <= 0x7E)
            )
        ) {
            bytes += 1;
            continue;
        }

        if( (// non-overlong 2-byte
                (0xC2 <= bytes[0] && bytes[0] <= 0xDF) &&
                (0x80 <= bytes[1] && bytes[1] <= 0xBF)
            )
        ) {
            bytes += 2;
            continue;
        }

        if( (// excluding overlongs
                bytes[0] == 0xE0 &&
                (0xA0 <= bytes[1] && bytes[1] <= 0xBF) &&
                (0x80 <= bytes[2] && bytes[2] <= 0xBF)
            ) ||
            (// straight 3-byte
                ((0xE1 <= bytes[0] && bytes[0] <= 0xEC) ||
                    bytes[0] == 0xEE ||
                    bytes[0] == 0xEF) &&
                (0x80 <= bytes[1] && bytes[1] <= 0xBF) &&
                (0x80 <= bytes[2] && bytes[2] <= 0xBF)
            ) ||
            (// excluding surrogates
                bytes[0] == 0xED &&
                (0x80 <= bytes[1] && bytes[1] <= 0x9F) &&
                (0x80 <= bytes[2] && bytes[2] <= 0xBF)
            )
        ) {
            bytes += 3;
            continue;
        }

        if( (// planes 1-3
                bytes[0] == 0xF0 &&
                (0x90 <= bytes[1] && bytes[1] <= 0xBF) &&
                (0x80 <= bytes[2] && bytes[2] <= 0xBF) &&
                (0x80 <= bytes[3] && bytes[3] <= 0xBF)
            ) ||
            (// planes 4-15
                (0xF1 <= bytes[0] && bytes[0] <= 0xF3) &&
                (0x80 <= bytes[1] && bytes[1] <= 0xBF) &&
                (0x80 <= bytes[2] && bytes[2] <= 0xBF) &&
                (0x80 <= bytes[3] && bytes[3] <= 0xBF)
            ) ||
            (// plane 16
                bytes[0] == 0xF4 &&
                (0x80 <= bytes[1] && bytes[1] <= 0x8F) &&
                (0x80 <= bytes[2] && bytes[2] <= 0xBF) &&
                (0x80 <= bytes[3] && bytes[3] <= 0xBF)
            )
        ) {
            bytes += 4;
            continue;
        }

        return 0;
    }

    return 1;
}
\u Bool是\u utf8(常量字符*字符串)
{
如果(!字符串)
返回0;
常量无符号字符*字节=(常量无符号字符*)字符串;
而(*字节)
{
if(//ASCII

//使用字节[0]这个由Bjoern Hoermann设计的解码器是我发现的最简单的解码器。它还可以通过向它提供一个字节以及保持一个状态来工作。该状态对于解析通过网络分块输入的UTF8非常有用

如果文本有效,则返回utf8
utf8\u ACCEPT
。如果文本无效,则返回其他整数

以块(例如从网络)形式提供数据的使用示例:


不可能检测到给定的字节数组是UTF-8字符串。您可以可靠地确定它不可能是有效的UTF-8(这并不意味着它不是无效的UTF-8);您可以确定它可能是有效的UTF-8序列,但可能会出现误报

举一个简单的例子,使用随机数生成器生成一个由3个随机字节组成的数组,并使用它来测试代码。这些是随机字节,因此不是UTF-8,因此代码认为“可能是UTF-8”的每个字符串都是误报。我猜(在这些情况下)您的代码在12%的时间内都会出错

一旦你意识到这是不可能的,你可以开始考虑返回一个置信水平(除了你的预测)。例如,你的函数可能会返回类似“我88%确定这是UTF-8”的内容

现在对所有其他类型的数据执行此操作。例如,您可能有一个函数,用于检查数据是否为UTF-16,该函数可能会返回“我95%确信这是UTF-16”,然后确定(因为95%高于88%)数据更有可能是UTF-16而不是UTF-8

下一步是添加技巧以提高置信度。例如,如果字符串似乎主要包含由空格分隔的有效音节组,则您可以更确信它实际上是UTF-8。同样,如果数据可能是HTML,则可以检查可能是有效HTML标记的内容,并使用t增加你的自信

当然,这同样适用于其他类型的数据。例如,如果数据具有有效的PE32或ELF头,或者具有正确的BMP或JPG或MP3头,那么您可以更加确信它根本不是UTF-8


更好的方法是修复问题的实际原因。例如,可以在您关心的所有文件的开头添加某种“文档类型”标识符,或者说“此软件采用UTF-8,不支持任何其他内容”;这样您就不必在第一时间做出不可靠的猜测。

根据我的计算,3个随机字节似乎有15.8%的机会成为有效的UTF-8:

128^3可能的仅ASCII序列=2097152

2^16-2^11可能的3字节UTF-8字符(假设允许代理项对和非字符)=63488

ASCII字符之前或之后的1920个2字节UTF-8字符=1920*128*2=524288

除以3字节序列的数量=(2097152+63488+491520)/16777216.0=0.1580810546875

这大大高估了不正确匹配的数量,因为文件只有3个字节长。随着字节数的增加,交集逐渐减少。此外,非UTF-8中的实际文本不是随机的,存在大量高位设置的孤立字节,这是无效的UTF-8

猜测失败几率的一个更有用的指标是具有高位集的字节序列有效UTF-8的可能性。我得到以下值:

1 byte = 0% # the really important number that is often ignored
2 byte = 11.7%
3 byte = 3.03% (assumes surrogate halves are valid)
4 byte = 1.76% (includes two 2-byte characters)
尝试寻找一个实际可读的字符串(在任何语言和任何编码中)也是一个有效的UTF-8字符串也是很有用的
// Copyright (c) 2008-2009 Bjoern Hoehrmann <bjoern@hoehrmann.de>
// See http://bjoern.hoehrmann.de/utf-8/decoder/dfa/ for details.

#define UTF8_ACCEPT 0
#define UTF8_REJECT 1

static const uint8_t utf8d[] = {
  0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, // 00..1f
  0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, // 20..3f
  0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, // 40..5f
  0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, // 60..7f
  1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9, // 80..9f
  7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7, // a0..bf
  8,8,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2, // c0..df
  0xa,0x3,0x3,0x3,0x3,0x3,0x3,0x3,0x3,0x3,0x3,0x3,0x3,0x4,0x3,0x3, // e0..ef
  0xb,0x6,0x6,0x6,0x5,0x8,0x8,0x8,0x8,0x8,0x8,0x8,0x8,0x8,0x8,0x8, // f0..ff
  0x0,0x1,0x2,0x3,0x5,0x8,0x7,0x1,0x1,0x1,0x4,0x6,0x1,0x1,0x1,0x1, // s0..s0
  1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,1,1,1,1,1,0,1,0,1,1,1,1,1,1, // s1..s2
  1,2,1,1,1,1,1,2,1,2,1,1,1,1,1,1,1,1,1,1,1,1,1,2,1,1,1,1,1,1,1,1, // s3..s4
  1,2,1,1,1,1,1,1,1,2,1,1,1,1,1,1,1,1,1,1,1,1,1,3,1,3,1,1,1,1,1,1, // s5..s6
  1,3,1,1,1,1,1,3,1,3,1,1,1,1,1,1,1,3,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // s7..s8
};

uint32_t inline
decode(uint32_t* state, uint32_t* codep, uint32_t byte) {
  uint32_t type = utf8d[byte];

  *codep = (*state != UTF8_ACCEPT) ?
    (byte & 0x3fu) | (*codep << 6) :
    (0xff >> type) & (byte);

  *state = utf8d[256 + *state*16 + type];
  return *state;
}
uint32_t validate_utf8(uint32_t *state, char *str, size_t len) {
   size_t i;
   uint32_t type;

    for (i = 0; i < len; i++) {
        // We don't care about the codepoint, so this is
        // a simplified version of the decode function.
        type = utf8d[(uint8_t)str[i]];
        *state = utf8d[256 + (*state) * 16 + type];

        if (*state == UTF8_REJECT)
            break;
    }

    return *state;
}
char buf[128];
size_t bytes_read;
uint32_t state = UTF8_ACCEPT;

// Validate the UTF8 data in chunks.
while ((bytes_read = get_new_data(buf, sizeof(buf))) {
    if (validate_utf8(&state, buf, bytes_read) == UTF8_REJECT)) {
        fprintf(stderr, "Invalid UTF8 data!\n");
        return -1;
    }
}

// If everything went well we should have proper UTF8,
// the data might instead have ended in the middle of a UTF8
// codepoint.
if (state != UTF8_ACCEPT) {
    fprintf(stderr, "Invalid UTF8, incomplete codepoint\n");
}
1 byte = 0% # the really important number that is often ignored
2 byte = 11.7%
3 byte = 3.03% (assumes surrogate halves are valid)
4 byte = 1.76% (includes two 2-byte characters)
#include<stdio.h>

#include<string.h>

/* UTF-8 : BYTE_BITS*/

/* B0_BYTE : 0XXXXXXX */

/* B1_BYTE : 10XXXXXX */

/* B2_BYTE : 110XXXXX */

/* B3_BYTE : 1110XXXX */

/* B4_BYTE : 11110XXX */

/* B5_BYTE : 111110XX */

/* B6_BYTE : 1111110X */

#define B0_BYTE 0x00

#define B1_BYTE 0x80

#define B2_BYTE 0xC0

#define B3_BYTE 0xE0

#define B4_BYTE 0xF0

#define B5_BYTE 0xF8

#define B6_BYTE 0xFC

#define B7_BYTE 0xFE

/* Please tune this as per number of lines input */

#define MAX_UTF8_STR 10

/* 600 is used because 6byteX100chars */

#define MAX_UTF8_CHR 600

void func_find_utf8 (char *ptr_to_str);

void print_non_ascii (int bytes, char *pbyte);

char strbuf[MAX_UTF8_STR][MAX_UTF8_CHR];

int
main (int ac, char *av[])
{

  int i = 0;

  char no_newln_str[MAX_UTF8_CHR];

  i = 0;

  printf ("\n\nYou can enter utf-8 string or Q/q to QUIT\n\n");

  while (i < MAX_UTF8_STR)
    {

      fgets (strbuf[i], MAX_UTF8_CHR, stdin);

      if (!strlen (strbuf[i]))
    break;

      if ((strbuf[i][0] == 'Q') || (strbuf[i][0] == 'q'))
    break;

      strcpy (no_newln_str, strbuf[i]);

      no_newln_str[strlen (no_newln_str) - 1] = 0;

      func_find_utf8 (no_newln_str);

      ++i;

    }

  return 1;

}

void
func_find_utf8 (char *ptr_to_str)
{

  int found_non_ascii;

  char *pbyte;

  pbyte = ptr_to_str;

  found_non_ascii = 0;

  while (*pbyte)
    {

      if ((*pbyte & B1_BYTE) == B0_BYTE)
    {

      pbyte++;

      continue;

    }

      else
    {

      found_non_ascii = 1;

      if ((*pbyte & B7_BYTE) == B6_BYTE)
        {

          print_non_ascii (6, pbyte);

          pbyte += 6;

          continue;

        }

      if ((*pbyte & B6_BYTE) == B5_BYTE)
        {

          print_non_ascii (5, pbyte);

          pbyte += 5;

          continue;

        }

      if ((*pbyte & B5_BYTE) == B4_BYTE)
        {

          print_non_ascii (4, pbyte);

          pbyte += 4;

          continue;

        }

      if ((*pbyte & B4_BYTE) == B3_BYTE)
        {

          print_non_ascii (3, pbyte);

          pbyte += 3;

          continue;

        }

      if ((*pbyte & B3_BYTE) == B2_BYTE)
        {

          print_non_ascii (2, pbyte);

          pbyte += 2;

          continue;

        }

    }

    }

  if (found_non_ascii)
    printf (" These are Non Ascci chars\n");

}

void
print_non_ascii (int bytes, char *pbyte)
{

  char store[6];

  int i;

  memset (store, 0, 6);

  memcpy (store, pbyte, bytes);

  i = 0;

  while (i < bytes)
    printf ("%c", store[i++]);

  printf ("%c", ' ');

  fflush (stdout);

}
/*
** Checks if the given string has all bytes like: 10xxxxxx
** where x is either 0 or 1
*/

static int      chars_are_folow_uni(const unsigned char *chars)
{
    while (*chars)
    {
        if ((*chars >> 6) != 0x2)
            return (0);
        chars++;
    }
    return (1);
}

int             char_is_utf8(const unsigned char *key)
{
    int         required_len;

    if (key[0] >> 7 == 0)
        required_len = 1;
    else if (key[0] >> 5 == 0x6)
        required_len = 2;
    else if (key[0] >> 4 == 0xE)
        required_len = 3;
    else if (key[0] >> 5 == 0x1E)
        required_len = 4;
    else
        return (0);
    return (strlen(key) == required_len && chars_are_folow_uni(key + 1));
}
unsigned char   buf[5];

ft_to_utf8(L'歓', buf);
printf("%d\n", char_is_utf8(buf)); // => 1
#include <stdint.h>
/**
 * Maps the last 5 bits in a byte (0b11111xxx) to a UTF-8 codepoint length.
 *
 * Codepoint length 0 == error.
 *
 * The first valid length can be any value between 1 to 4 (5== error).
 *
 * An intermidiate (second, third or forth) valid length must be 5.
 *
 * To map was populated using the following Ruby script:
 *
 *      map = []; 32.times { map << 0 }; (0..0b1111).each {|i| map[i] = 1} ;
 *      (0b10000..0b10111).each {|i| map[i] = 5} ;
 *      (0b11000..0b11011).each {|i| map[i] = 2} ;
 *      (0b11100..0b11101).each {|i| map[i] = 3} ;
 *      map[0b11110] = 4; map;
 */
static uint8_t fio_str_utf8_map[] = {1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
                                     1, 1, 1, 1, 1, 5, 5, 5, 5, 5, 5,
                                     5, 5, 2, 2, 2, 2, 3, 3, 4, 0};

/**
 * Advances the `ptr` by one utf-8 character, placing the value of the UTF-8
 * character into the i32 variable (which must be a signed integer with 32bits
 * or more). On error, `i32` will be equal to `-1` and `ptr` will not step
 * forwards.
 *
 * The `end` value is only used for overflow protection.
 */
#define FIO_STR_UTF8_CODE_POINT(ptr, end, i32)                                 \
  switch (fio_str_utf8_map[((uint8_t *)(ptr))[0] >> 3]) {                      \
  case 1:                                                                      \
    (i32) = ((uint8_t *)(ptr))[0];                                             \
    ++(ptr);                                                                   \
    break;                                                                     \
  case 2:                                                                      \
    if (((ptr) + 2 > (end)) ||                                                 \
        fio_str_utf8_map[((uint8_t *)(ptr))[1] >> 3] != 5) {                   \
      (i32) = -1;                                                              \
      break;                                                                   \
    }                                                                          \
    (i32) =                                                                    \
        ((((uint8_t *)(ptr))[0] & 31) << 6) | (((uint8_t *)(ptr))[1] & 63);    \
    (ptr) += 2;                                                                \
    break;                                                                     \
  case 3:                                                                      \
    if (((ptr) + 3 > (end)) ||                                                 \
        fio_str_utf8_map[((uint8_t *)(ptr))[1] >> 3] != 5 ||                   \
        fio_str_utf8_map[((uint8_t *)(ptr))[2] >> 3] != 5) {                   \
      (i32) = -1;                                                              \
      break;                                                                   \
    }                                                                          \
    (i32) = ((((uint8_t *)(ptr))[0] & 15) << 12) |                             \
            ((((uint8_t *)(ptr))[1] & 63) << 6) |                              \
            (((uint8_t *)(ptr))[2] & 63);                                      \
    (ptr) += 3;                                                                \
    break;                                                                     \
  case 4:                                                                      \
    if (((ptr) + 4 > (end)) ||                                                 \
        fio_str_utf8_map[((uint8_t *)(ptr))[1] >> 3] != 5 ||                   \
        fio_str_utf8_map[((uint8_t *)(ptr))[2] >> 3] != 5 ||                   \
        fio_str_utf8_map[((uint8_t *)(ptr))[3] >> 3] != 5) {                   \
      (i32) = -1;                                                              \
      break;                                                                   \
    }                                                                          \
    (i32) = ((((uint8_t *)(ptr))[0] & 7) << 18) |                              \
            ((((uint8_t *)(ptr))[1] & 63) << 12) |                             \
            ((((uint8_t *)(ptr))[2] & 63) << 6) |                              \
            (((uint8_t *)(ptr))[3] & 63);                                      \
    (ptr) += 4;                                                                \
    break;                                                                     \
  default:                                                                     \
    (i32) = -1;                                                                \
    break;                                                                     \
  }

/** Returns 1 if the String is UTF-8 valid and 0 if not. */
inline static size_t fio_str_utf8_valid2(char const *str, size_t length) {
  if (!str)
    return 0;
  if (!length)
    return 1;
  const char *const end = str + length;
  int32_t c = 0;
  do {
    FIO_STR_UTF8_CODE_POINT(str, end, c);
  } while (c > 0 && str < end);
  return str == end && c >= 0;
}
#define __FALSE (0)
#define __TRUE  (!__FALSE)

#define MS1BITCNT_0_IS_0xxxxxxx_NO_SUCCESSOR        (0)
#define MS1BITCNT_1_IS_10xxxxxx_IS_SUCCESSOR        (1)
#define MS1BITCNT_2_IS_110xxxxx_HAS_1_SUCCESSOR     (2)
#define MS1BITCNT_3_IS_1110xxxx_HAS_2_SUCCESSORS    (3)
#define MS1BITCNT_4_IS_11110xxx_HAS_3_SUCCESSORS    (4)

typedef int __BOOL;

int CountMS1BitSequenceAndForward(const char **p) {
    int     Mask;
    int     Result = 0;
    char    c = **p;
    ++(*p);
    for (Mask=0x80;c&(Mask&0xFF);Mask>>=1,++Result);
    return Result;
}


int MS1BitSequenceCount2SuccessorByteCount(int MS1BitSeqCount) {
    switch (MS1BitSeqCount) {
    case MS1BITCNT_2_IS_110xxxxx_HAS_1_SUCCESSOR: return 1;
    case MS1BITCNT_3_IS_1110xxxx_HAS_2_SUCCESSORS: return 2;
    case MS1BITCNT_4_IS_11110xxx_HAS_3_SUCCESSORS: return 3;
    }
    return 0;
}

__BOOL ExpectUTF8SuccessorCharsOrReturnFalse(const char **Str, int NumberOfCharsToExpect) {
    while (NumberOfCharsToExpect--) {
        if (CountMS1BitSequenceAndForward(Str) != MS1BITCNT_1_IS_10xxxxxx_IS_SUCCESSOR) {
            return __FALSE;
        }
    }
    return __TRUE;
}

__BOOL IsMS1BitSequenceCountAValidUTF8Starter(int Number) {
    switch (Number) {
    case MS1BITCNT_0_IS_0xxxxxxx_NO_SUCCESSOR:
    case MS1BITCNT_2_IS_110xxxxx_HAS_1_SUCCESSOR:
    case MS1BITCNT_3_IS_1110xxxx_HAS_2_SUCCESSORS:
    case MS1BITCNT_4_IS_11110xxx_HAS_3_SUCCESSORS:
        return __TRUE;
    }
    return __FALSE;
}

#define NO_FURTHER_CHECKS_REQUIRED_IT_IS_NOT_UTF8       (-1)
#define NOT_ALL_EXPECTED_SUCCESSORS_ARE_10xxxxxx        (-1)

int CountValidUTF8CharactersOrNegativeOnBadUTF8(const char *Str) {
    int NumberOfValidUTF8Sequences = 0;
    if (!Str || !Str[0]) { return 0; }
    while (*Str) {
        int MS1BitSeqCount = CountMS1BitSequenceAndForward(&Str);
        if (!IsMS1BitSequenceCountAValidUTF8Starter(MS1BitSeqCount)) {
            return NO_FURTHER_CHECKS_REQUIRED_IT_IS_NOT_UTF8;
        }
        if (!ExpectUTF8SuccessorCharsOrReturnFalse(&Str, MS1BitSequenceCount2SuccessorByteCount(MS1BitSeqCount))) {
            return NOT_ALL_EXPECTED_SUCCESSORS_ARE_10xxxxxx;
        }
        if (MS1BitSeqCount) { ++NumberOfValidUTF8Sequences; }
    }
    return NumberOfValidUTF8Sequences;
}
static void TestUTF8CheckOrDie(const char *Str, int ExpectedResult) {
    int Result = CountValidUTF8CharactersOrNegativeOnBadUTF8(Str);
    if (Result != ExpectedResult) {
        printf("TEST FAILED: %s:%i: check on '%s' returned %i, but expected was %i\n", __FILE__, __LINE__, Str, Result, ExpectedResult);
        exit(1);
    }
}

void SimpleUTF8TestCases(void) {
    TestUTF8CheckOrDie("abcd89234", 0);  // neither valid nor invalid UTF8 sequences
    TestUTF8CheckOrDie("", 0);           // neither valid nor invalid UTF8 sequences
    TestUTF8CheckOrDie(NULL, 0);
    TestUTF8CheckOrDie("asdföadkg", 1);  // contains one valid UTF8 character sequence
    TestUTF8CheckOrDie("asdföadäkg", 2); // contains two valid UTF8 character sequences
    TestUTF8CheckOrDie("asdf\xF8" "adäkg", -1); // contains at least one invalid UTF8 sequence
}