UTF-8和PHP DOMDocument loadHTML?

UTF-8和PHP DOMDocument loadHTML?,php,utf-8,Php,Utf 8,考虑这个例子,test.php: <?php $mystr = "<p>Hello, με काचं ça øy jeść</p>"; var_dump($mystr); $domdoc = new DOMDocument('1.0', 'utf-8'); //DOMDocument(); $domdoc->loadHTML($mystr); // already here corrupt UTF-8? var_dump($domdoc); ?>

考虑这个例子,
test.php

<?php
$mystr = "<p>Hello, με काचं  ça øy jeść</p>";
var_dump($mystr);
$domdoc = new DOMDocument('1.0', 'utf-8'); //DOMDocument();
$domdoc->loadHTML($mystr); // already here corrupt UTF-8?
var_dump($domdoc);
?>

如果我使用PHP5.5.9(cli)运行此命令,我将进入终端:

$ php test.php 
string(50) "<p>Hello, με काचं  ça øy jeść</p>"
object(DOMDocument)#1 (34) {
  ["doctype"]=>
  string(22) "(object value omitted)"
...
  ["actualEncoding"]=>
  NULL
  ["encoding"]=>
  NULL
  ["xmlEncoding"]=>
  NULL
...
  ["textContent"]=>
  string(70) "Hello, με à¤à¤¾à¤à¤  ça øy jeÅÄ"
}
$php test.php
字符串(50)“你好,μεकाचं  çaøy jeść

” 对象(DOMDocument)#1(34){ [“doctype”]=> 字符串(22)”(省略对象值) ... [“实际编码”]=> 无效的 [“编码”]=> 无效的 [“xmlEncoding”]=> 无效的 ... [“文本内容”]=> 弦乐(70)“你好,我的朋友” }
显然,原始字符串是正确的UTF-8,但是DOMDocument的
textContent
编码不正确

那么,我如何才能在DOMDocument中获得正确的UTF-8内容呢?

是在其HTML解析器的基础上为HTML4构建的,其默认编码是ISO-8859-1。除非遇到适当的元标记或XML声明,否则将假定内容为ISO-8859-1

在创建as-you-have时指定编码不会影响解析器的工作-加载HTML(或XML)将替换XML版本和您为其构造函数提供的编码


解决办法: 首先用于将ASCII范围以上的任何内容转换为其html实体等效项

$domdoc->loadHTML(mb_convert_encoding($mystr, 'HTML-ENTITIES', 'UTF-8'));
或者侵入指定UTF-8的元标记或xml声明

$domdoc->loadHTML('<meta http-equiv="Content-Type" content="charset=utf-8" />' . $mystr);
$domdoc->loadHTML('.$mystr);
$domdoc->loadHTML('.$mystr);

只是想发布操作码和对我有效的修复:

<?php
$mystr = "<p>Hello, με काचं  ça øy jeść</p>";
var_dump($mystr);
$domdoc = new DOMDocument('1.0', 'UTF-8'); //DOMDocument();
$domdoc->substituteEntities = true; // no effect if hack is done
//~ $domdoc->actualEncoding = 'UTF-8'; // Cannot write property
$domdoc->encoding = 'UTF-8'; // no effect
//~ $domdoc->xmlEncoding = 'UTF-8'; // Cannot write property
//~ $domdoc->loadHTML($mystr); // already here corrupt UTF-8?
//~ $domdoc->loadHTML(utf8_decode($mystr)); // this gets to <p>Hello, ?? ?????  ça øy je??</p>, so not all
//~ $domdoc->loadHTML( mb_convert_encoding($mystr, 'utf-8', mb_detect_encoding($mystr)) ); // no dice
$domdoc->loadHTML('<?xml encoding="UTF-8">' . $mystr); // hack, http://php.net/manual/en/domdocument.loadhtml.php#95251
// dirty fix
foreach ($domdoc->childNodes as $item)
    if ($item->nodeType == XML_PI_NODE)
        $domdoc->removeChild($item); // remove hack
$domdoc->encoding = 'UTF-8'; // insert proper (sets all three)
var_dump($domdoc);
print $domdoc->saveXML(); // without ->encoding = 'UTF-8': Hello, &#x3BC;&#x3B5; &#xFEFF;&#x915;&#x93E;&#x91A;&#x902; else OK
//~ print mb_convert_encoding($domdoc->saveXML(), 'UTF-8', 'HTML-ENTITIES'); // if without ->encoding = 'UTF-8', this is then OK: <p>Hello, με काचं  ça øy jeść</p>
?>

这将产生:

$ php test.php 
string(50) "<p>Hello, με काचं  ça øy jeść</p>"
object(DOMDocument)#1 (34) {
  ["doctype"]=>
  string(22) "(object value omitted)"
...
  ["actualEncoding"]=>
  string(5) "UTF-8"
  ["encoding"]=>
  string(5) "UTF-8"
  ["xmlEncoding"]=>
  string(5) "UTF-8"
...
  ["textContent"]=>
  string(43) "Hello, με काचं  ça øy jeść"
}
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><p>Hello, με काचं  ça øy jeść</p></body></html>
$php test.php
字符串(50)“你好,μεकाचं  çaøy jeść

” 对象(DOMDocument)#1(34){ [“doctype”]=> 字符串(22)”(省略对象值) ... [“实际编码”]=> 串(5)“UTF-8” [“编码”]=> 串(5)“UTF-8” [“xmlEncoding”]=> 串(5)“UTF-8” ... [“文本内容”]=> 字符串(43)“你好,μεकाचं çaøy jeść“ } 你好,约翰काचं çaøy jeść


。。。现在一切都好了
:)

我不确定这个字符串是否真的是utf8,如果你像那样把文本放在那里谢谢@aleksv-有什么建议我应该怎么做才能使字符串成为utf8吗?也许这能帮上忙,谢谢,@aleksv-通过这个链接,我最终找到了解决问题的黑客……谢谢@PaulCrovella-我设法让它与预编xml声明黑客一起工作;在下面发布我的解决方案。。。干杯
$ php test.php 
string(50) "<p>Hello, με काचं  ça øy jeść</p>"
object(DOMDocument)#1 (34) {
  ["doctype"]=>
  string(22) "(object value omitted)"
...
  ["actualEncoding"]=>
  string(5) "UTF-8"
  ["encoding"]=>
  string(5) "UTF-8"
  ["xmlEncoding"]=>
  string(5) "UTF-8"
...
  ["textContent"]=>
  string(43) "Hello, με काचं  ça øy jeść"
}
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><p>Hello, με काचं  ça øy jeść</p></body></html>