带有字符串append chr（241）的PHP mb_strlen_Php_Encoding_Utf 8_Strlen

带有字符串append chr（241）的PHP mb_strlen

php encoding utf-8

带有字符串append chr（241）的PHP mb_strlen,php,encoding,utf-8,strlen,Php,Encoding,Utf 8,Strlen,所以我遇到了这个问题，我尽可能地简化了它 $test = 'XXX' . chr(241) . 'XXX'; print($test); // XXX�XXX print(mb_strlen($test, 'UTF-8')); // 4 print(count(str_split($test))); // 7 所以基本上我的问题是：为什么chr（241）不返回一个字符来表示字符串的长度7？六个字符，我加一个，四个字符？为什么chr（241）不等于html实体241 下面列出的其他信息。请注意，

所以我遇到了这个问题，我尽可能地简化了它

$test = 'XXX' . chr(241) . 'XXX';
print($test); // XXX�XXX
print(mb_strlen($test, 'UTF-8')); // 4
print(count(str_split($test))); // 7

所以基本上我的问题是：为什么chr（241）不返回一个字符来表示字符串的长度7？六个字符，我加一个，四个字符？为什么chr（241）不等于html实体241

下面列出的其他信息。请注意，只要不在chr（241）之后添加X，每个人都会感到高兴：

print(mb_detect_encoding($test)); // UTF-8
print(mb_strlen('XX' . chr(241) . 'XX', 'UTF-8')); // 3
print(mb_strlen('X' . chr(241) . 'X', 'UTF-8')); // 2
print(mb_strlen('' . chr(241) . 'X', 'UTF-8')); // 1
print(mb_strlen('X' . chr(241) . '', 'UTF-8')); // 2
print(mb_strlen('XXX' . chr(241) . '', 'UTF-8')); // 4
print(mb_strlen(chr(241), 'UTF-8')); // 1

这似乎是一个编码问题，但如何解决？该文件保存为UTF-8，内部编码为UTF-8，我不会在任何地方传递数据来搞乱它。

在UTF-8中

下的所有ASCII字符都由一个字节表示（二进制表示为

0xxxxxxx

）大于

的码点由多字节序列表示。多字节序列由一个前导字节和一个或多个连续字节组成
前导字节的高阶位用于告诉我们要使用多少个连续字节，为此，它有两个或多个高阶1后跟一个0，即高位可以是
110
或
1110
或
11110
或
11111 0
。高阶位的数量等于前导字节加上连续字节的总和，即

110 means 1 leading byte + 1 continuation byte 1110 means 1 leading byte + 2 continuation bytes 11110 means 1 leading byte + 3 continuation bytes
前导字节后面的连续字节的格式为
10xxxxxx
将上述内容应用于
$test
字符串：
我们有三个字节
ord（'X'）
，它们都是
127
下的ascii字符，因此它们被计算为1个字符对1个字节
然后我们有一个二进制表示为11110001的
chr（241）
，因此它是一个前导字节，因为它有两个或更多高位
由于它有4个高位，这意味着它所表示的代码点由1个前导字节加上3个延续字节组成，因此保留在字符串中的3个
ord（'X'）
字节被
mb_strlen（）
视为延续字节*，尽管与chr（241）一起总共是四个字节，它们被计为一个UTF-8码点
*在这里，我们必须声明那些尾随的“X”不是有效的延续字节，因为它们不符合延续字节的标准。但是，
mb_strlen（）
将如上文所述在
chr（241）
之后再消耗3个字节。如果您添加另一个
'X
，或者从
$test
字符串的末尾减去
'X
，则可以测试这一点
更新：验证结果：

/* * The following strings are non valid UTF-8 encodings. * We test to see if mb_strlen() consumes non VALID UTF-8 * byte strings like they are valid (driven by the leading bytes) * */ /* * 0xc0 as a leading byte should consume one continuation byte * so the length reported should be 6 */ $test = 'XXX' . chr(0xc0) . 'XXX'; echo '6 == ', mb_strlen($test, 'UTF8'); /* * 0xe0 as a leading byte should consume two continuation bytes * so the length reported should be 5 */ $test = 'XXX' . chr(0xe0) . 'XXX'; echo '5 == ', mb_strlen($test, 'UTF8'), PHP_EOL; // results in 6 == 6 and 5 == 5
更新2：
使用
chr（）
构建拉丁语-1和UTF-8中相同符号的示例

$euroSignAscii = chr(0x80); // Latin-1 extended ASCII $euroSignUtf8 = chr(0xe2) . chr(0x82) . chr(0xac); // UTF-8
请注意，如果您将上述字符串与控制台或网页的编码相呼应（如果是拉丁语-1，则
$euroSignAscii
将正确输出；如果是UTF-8，则
$euroSignUtf8
将正确输出）

链接：

/* * The following strings are non valid UTF-8 encodings. * We test to see if mb_strlen() consumes non VALID UTF-8 * byte strings like they are valid (driven by the leading bytes) * */ /* * 0xc0 as a leading byte should consume one continuation byte * so the length reported should be 6 */ $test = 'XXX' . chr(0xc0) . 'XXX'; echo '6 == ', mb_strlen($test, 'UTF8'); /* * 0xe0 as a leading byte should consume two continuation bytes * so the length reported should be 5 */ $test = 'XXX' . chr(0xe0) . 'XXX'; echo '5 == ', mb_strlen($test, 'UTF8'), PHP_EOL; // results in 6 == 6 and 5 == 5
一个好的参考是相关的
Joel Spolsky的经典帖子

为了在UTF-8中获得感觉，
127
下的所有ASCII字符都用一个字节表示（二进制表示为
0xxxxxxx
），大于
127
的码点用多字节序列表示。多字节序列由一个前导字节和一个或多个连续字节组成
前导字节的高阶位用于告诉我们要使用多少个连续字节，为此，它有两个或多个高阶1后跟一个0，即高位可以是
110
或
1110
或
11110
或
11111 0
。高阶位的数量等于前导字节加上连续字节的总和，即

110 means 1 leading byte + 1 continuation byte 1110 means 1 leading byte + 2 continuation bytes 11110 means 1 leading byte + 3 continuation bytes
前导字节后面的连续字节的格式为
10xxxxxx
将上述内容应用于
$test
字符串：
我们有三个字节
ord（'X'）
，它们都是
127
下的ascii字符，因此它们被计算为1个字符对1个字节
然后我们有一个二进制表示为11110001的
chr（241）
，因此它是一个前导字节，因为它有两个或更多高位
由于它有4个高位，这意味着它所表示的代码点由1个前导字节加上3个延续字节组成，因此保留在字符串中的3个
ord（'X'）
字节被
mb_strlen（）
视为延续字节*，尽管与chr（241）一起总共是四个字节，它们被计为一个UTF-8码点
*在这里，我们必须声明那些尾随的“X”不是有效的延续字节，因为它们不符合延续字节的标准。但是，
mb_strlen（）
将如上文所述在
chr（241）
之后再消耗3个字节。如果您添加另一个
'X
，或者从
$test
字符串的末尾减去
'X
，则可以测试这一点
更新：验证结果：

/* * The following strings are non valid UTF-8 encodings. * We test to see if mb_strlen() consumes non VALID UTF-8 * byte strings like they are valid (driven by the leading bytes) * */ /* * 0xc0 as a leading byte should consume one continuation byte * so the length reported should be 6 */ $test = 'XXX' . chr(0xc0) . 'XXX'; echo '6 == ', mb_strlen($test, 'UTF8'); /* * 0xe0 as a leading byte should consume two continuation bytes * so the length reported should be 5 */ $test = 'XXX' . chr(0xe0) . 'XXX'; echo '5 == ', mb_strlen($test, 'UTF8'), PHP_EOL; // results in 6 == 6 and 5 == 5
更新2：
使用
chr（）
构建拉丁语-1和UTF-8中相同符号的示例

$euroSignAscii = chr(0x80); // Latin-1 extended ASCII $euroSignUtf8 = chr(0xe2) . chr(0x82) . chr(0xac); // UTF-8
请注意，如果您将上述字符串与控制台或网页的编码相呼应（如果是拉丁语-1，则
$euroSignAscii
将正确输出；如果是UTF-8，则
$euroSignUtf8
将正确输出）

链接：

/* * The following strings are non valid UTF-8 encodings. * We test to see if mb_strlen() consumes non VALID UTF-8 * byte strings like they are valid (driven by the leading bytes) * */ /* * 0xc0 as a leading byte should consume one continuation byte * so the length reported should be 6 */ $test = 'XXX' . chr(0xc0) . 'XXX'; echo '6 == ', mb_strlen($test, 'UTF8'); /* * 0xe0 as a leading byte should consume two continuation bytes * so the length reported should be 5 */ $test = 'XXX' . chr(0xe0) . 'XXX'; echo '5 == ', mb_strlen($test, 'UTF8'), PHP_EOL; // results in 6 == 6 and 5 == 5
一个好的参考是相关的
Joel Spolsky的经典帖子

为了在UTF-8中获得感觉，表示
127
下的所有ASCII字符