Php 不间断utf-8 0xc2a0空间和preg_替换奇怪行为_Php_Regex

Php 不间断utf-8 0xc2a0空间和preg_替换奇怪行为

php regex

Php 不间断utf-8 0xc2a0空间和preg_替换奇怪行为,php,regex,Php,Regex,在我的字符串中有utf-8非中断空间（0xc2a0），我想用其他东西来替换它当我使用 $str=preg_replace('~\xc2\xa0~', 'X', $str); 它工作正常但是当我使用 $str=preg_replace('~\x{C2A0}~siu', 'W', $str); 未找到（并替换）非中断空间为什么?？第二个regexp有什么问题格式\x{C2A0}是正确的，而且我使用了u标志。实际上，PHP中关于转义序列的文档是错误的。使用\xc2\xa0语法时，它会搜索U

在我的字符串中有utf-8非中断空间（0xc2a0），我想用其他东西来替换它

当我使用

$str=preg_replace('~\xc2\xa0~', 'X', $str);

它工作正常

但是当我使用

$str=preg_replace('~\x{C2A0}~siu', 'W', $str);

未找到（并替换）非中断空间

为什么?？第二个regexp有什么问题

格式

\x{C2A0}

是正确的，而且我使用了

标志。

实际上，PHP中关于转义序列的文档是错误的。使用

\xc2\xa0

语法时，它会搜索UTF-8字符。但是使用

\x{c2a0}

语法，它尝试将Unicode序列转换为UTF-8编码字符

非中断空格为

U+00A0

（Unicode），但在UTF-8中编码为

C2A0

。因此，如果您尝试使用模式

~\x{00a0}~siu

，它将按预期工作。

我认为这两个代码做了不同的事情：第一个

\xc2\xa0

将替换两个字符，

\xc2

和

\xa0

在UTF-8编码中，这恰好是

U+00A0

的码点

\x{00A0}

有效吗？这应该是

\xc2\xa0

的表示形式。我没有使用此变体

~\x{c2a0}~siu

瓦里安

\x{00A0}

工作正常。我没有尝试第二种选择，结果如下：

我尝试将其转换为十六进制，并将无中断空格

0xC2 0xA0（c2a0）

替换为空格

0x20（20）

代码：

我已经修改了以前的答案，因此人们可以复制/粘贴以下代码来选择他们最喜欢的方法：

$some_text_with_non_breaking_spaces = "some text with 2 non breaking spaces at the beginning";
echo 'Qty non-breaking space : ' . substr_count($some_text_with_non_breaking_spaces, "\xc2\xa0") . '<br>';
echo $some_text_with_non_breaking_spaces . '<br>';

# Method 1 : regular expression
$clean_text = preg_replace('~\x{00a0}~siu', ' ', $some_text_with_non_breaking_spaces);

# Method 2 : convert to bin -> replace -> convert to hex
$clean_text = hex2bin(str_replace('c2a0', '20', bin2hex($some_text_with_non_breaking_spaces)));

# Method 3 : my favorite
$clean_text = str_replace("\xc2\xa0", " ", $some_text_with_non_breaking_spaces);

echo 'Qty non-breaking space : ' . substr_count($clean_text, "\xc2\xa0"). '<br>';
echo $clean_text . '<br>';

$some_text_with_non_breaking_spaces=“some text with 2 non breaking spaces开头”；
回显“非中断空间数量：”。substr\u count（$some\u text\u带有非中断空格，“\xc2\xa0”）。'
'；
将$some_text_与_non_breaking_空格相呼应。”
'；
#方法1：正则表达式
$clean_text=preg_replace（'~\x{00a0}~siu'，''.$some_text_与_non_breaking_空格）；
#方法2：转换为二进制->替换->转换为十六进制
$clean_text=hex2bin（str_replace（'c2a0'，'20'，'bin2hex（$some_text_，带有非破坏性空格））；
#方法三：我最喜欢的
$clean_text=str_replace（“\xc2\xa0”，”，$some_text_，带有非破坏性空格）；
回显“非中断空间数量：”。子项计数（$clean_text，“\xc2\xa0”）。'
'；
回显$clean_文本。”
'；

/\x{00A0}/、/\xC2\xA0/和$clean_hex2bin-str_replace-bin2hex工作和不工作。如果我把它打印到屏幕上，一切都很好，但是如果我试图将它保存到一个文件中，该文件将是空白的

我最终使用了iconv（'UTF-8'，'ISO-8859-1//IGNORE'，$str）

可能是因为

$str

不是unicode字符串。您好，新手。你的回答对我有用，但我还是不明白为什么。是因为我的nbsp不是UTF-8吗？我的数据来自一个带有utf8\u general\u ci字符集的数据库表，因此它应该是UTF-8（我的字符集\u客户端和字符集\u连接也是UTF-8）。你有关于这个的更多信息的链接吗？谢谢很高兴能更多地了解这个问题。还有前一篇文章被复制/粘贴的地方。

hex2bin（）

变体是危险的，它将错误地替换错误对齐的事件。例如，考虑十六进制序列<代码> 0C2A0A < /代码>。

$some_text_with_non_breaking_spaces = "some text with 2 non breaking spaces at the beginning";
echo 'Qty non-breaking space : ' . substr_count($some_text_with_non_breaking_spaces, "\xc2\xa0") . '<br>';
echo $some_text_with_non_breaking_spaces . '<br>';

# Method 1 : regular expression
$clean_text = preg_replace('~\x{00a0}~siu', ' ', $some_text_with_non_breaking_spaces);

# Method 2 : convert to bin -> replace -> convert to hex
$clean_text = hex2bin(str_replace('c2a0', '20', bin2hex($some_text_with_non_breaking_spaces)));

# Method 3 : my favorite
$clean_text = str_replace("\xc2\xa0", " ", $some_text_with_non_breaking_spaces);

echo 'Qty non-breaking space : ' . substr_count($clean_text, "\xc2\xa0"). '<br>';
echo $clean_text . '<br>';