PHP DOM UTF-8问题

PHP DOM UTF-8问题,php,utf-8,domdocument,iconv,Php,Utf 8,Domdocument,Iconv,首先,我的数据库使用Windows-1250作为本机字符集。我将数据输出为UTF-8。我在我的网站上使用iconv()函数将Windows-1250字符串转换为UTF-8字符串,效果非常好 问题是当我使用PHPDOM解析数据库中存储的一些HTML时(HTML是WYSIWYG编辑器的输出,无效,没有HTML、head、body标记等) HTML可能看起来像这样,例如: <p>Hello</p> 你好 下面是我用来解析数据库中某个HTML的方法: private fun

首先,我的数据库使用Windows-1250作为本机字符集。我将数据输出为UTF-8。我在我的网站上使用iconv()函数将Windows-1250字符串转换为UTF-8字符串,效果非常好

问题是当我使用PHPDOM解析数据库中存储的一些HTML时(HTML是WYSIWYG编辑器的输出,无效,没有HTML、head、body标记等)

HTML可能看起来像这样,例如:

<p>Hello</p>
你好

下面是我用来解析数据库中某个HTML的方法:

 private function ParseSlideContent($slideContent)
 {
        var_dump(iconv('Windows-1250', 'UTF-8', $slideContent)); // this outputs the HTML ok with all special characters

  $doc = new DOMDocument('1.0', 'UTF-8');

  // hack to preserve UTF-8 characters
  $html = iconv('Windows-1250', 'UTF-8', $slideContent);
  $doc->loadHTML('<?xml encoding="UTF-8">' . $html);
  $doc->preserveWhiteSpace = false;

  foreach($doc->getElementsByTagName('img') as $t) {
   $path = trim($t->getAttribute('src'));
   $t->setAttribute('src', '/clientarea/utils/locate-image?path=' . urlencode($path));
  }
  foreach ($doc->getElementsByTagName('object') as $o) {
   foreach ($o->getElementsByTagName('param') as $p) {
    $path = trim($p->getAttribute('value'));
    $p->setAttribute('value', '/clientarea/utils/locate-flash?path=' . urlencode($path));
   }
  }
  foreach ($doc->getElementsByTagName('embed') as $e) {
   if (true === $e->hasAttribute('pluginspage')) {
    $path = trim($e->getAttribute('src'));
    $e->setAttribute('src', '/clientarea/utils/locate-flash?path=' . urlencode($path));
   } else {
    $path = end(explode('data/media/video/', trim($e->getAttribute('src'))));
    $path = 'data/media/video/' . $path;
    $path = '/clientarea/utils/locate-video?path=' . urlencode($path);
    $width = $e->getAttribute('width') . 'px';
    $height = $e->getAttribute('height') . 'px';
    $a = $doc->createElement('a', '');
    $a->setAttribute('href', $path);
    $a->setAttribute('style', "display:block;width:$width;height:$height;");
    $a->setAttribute('class', 'player');
    $e->parentNode->replaceChild($a, $e);
    $this->slideContainsVideo = true;
   }
  }

  $html = trim($doc->saveHTML());

  $html = explode('<body>', $html);
  $html = explode('</body>', $html[1]);
  return $html[0];
 }
私有函数ParseSlideContent($slideContent)
{
var_dump(iconv('Windows-1250','UTF-8',$slideContent));//输出带有所有特殊字符的HTML ok
$doc=新的DOMDocument('1.0','UTF-8');
//黑客保留UTF-8字符
$html=iconv('Windows-1250','UTF-8',$slideContent);
$doc->loadHTML(“”.$html);
$doc->preserveWhiteSpace=false;
foreach($doc->getElementsByTagName('img')作为$t){
$path=trim($t->getAttribute('src'));
$t->setAttribute('src','/clientrea/utils/locate image?path='.urlencode('path));
}
foreach($doc->getElementsByTagName('object')作为$o){
foreach($o->getElementsByTagName('param')作为$p){
$path=trim($p->getAttribute('value'));
$p->setAttribute('value','/clientrea/utils/locate flash?path='.urlencode('path));
}
}
foreach($doc->getElementsByTagName('embed')作为$e){
if(true==$e->hasAttribute('pluginspage')){
$path=trim($e->getAttribute('src'));
$e->setAttribute('src','/clientrea/utils/locate flash?path='.urlencode('path));
}否则{
$path=end(分解('data/media/video/'),修剪($e->getAttribute('src'));
$path='data/media/video/'。$path;
$path='/clientrea/utils/locate video?path='.urlencode($path);
$width=$e->getAttribute('width').'px';
$height=$e->getAttribute('height').'px';
$a=$doc->createElement('a','');
$a->setAttribute('href',$path);
$a->setAttribute('style',“显示:块;宽度:$width;高度:$height;”);
$a->setAttribute('class','player');
$e->parentNode->replaceChild($a,$e);
$this->slideContainsVideo=true;
}
}
$html=trim($doc->saveHTML());
$html=分解(“”,$html);
$html=explode(“”,$html[1]);
返回$html[0];
}
上述方法的输出是一个垃圾,所有特殊字符都替换为像ÚÄ这样的奇怪字符�.

还有一件事。它在我的开发服务器上运行

但它在生产服务器上不起作用

有什么建议吗

生产服务器的PHP版本:PHP版本5.2.0RC4-dev

开发服务器的PHP版本:PHP版本5.2.13


更新:

我自己正在研究一个解决方案。我从这个PHP错误报告中得到了灵感(虽然不是真正的错误):

这是我提出的解决方案。我将在明天试用,并告知您是否有效:

 private function ParseSlideContent($slideContent)
 {
        var_dump(iconv('Windows-1250', 'UTF-8', $slideContent)); // this outputs the HTML ok with all special characters

  $doc = new DOMDocument('1.0', 'UTF-8');

  // hack to preserve UTF-8 characters
  $html = iconv('Windows-1250', 'UTF-8', $slideContent);
  $doc->loadHTML('<?xml encoding="UTF-8">' . $html);
  $doc->preserveWhiteSpace = false;

  // this might work
  // it basically just adds head and meta tags to the document
  $html = $doc->getElementsByTagName('html')->item(0);
  $head = $doc->createElement('head', '');
  $meta = $doc->createElement('meta', '');
  $meta->setAttribute('http-equiv', 'Content-Type');
  $meta->setAttribute('content', 'text/html; charset=utf-8');
  $head->appendChild($meta);
  $body = $doc->getElementsByTagName('body')->item(0);
  $html->removeChild($body);
  $html->appendChild($head);
  $html->appendChild($body);

  foreach($doc->getElementsByTagName('img') as $t) {
   $path = trim($t->getAttribute('src'));
   $t->setAttribute('src', '/clientarea/utils/locate-image?path=' . urlencode($path));
  }
  foreach ($doc->getElementsByTagName('object') as $o) {
   foreach ($o->getElementsByTagName('param') as $p) {
    $path = trim($p->getAttribute('value'));
    $p->setAttribute('value', '/clientarea/utils/locate-flash?path=' . urlencode($path));
   }
  }
  foreach ($doc->getElementsByTagName('embed') as $e) {
   if (true === $e->hasAttribute('pluginspage')) {
    $path = trim($e->getAttribute('src'));
    $e->setAttribute('src', '/clientarea/utils/locate-flash?path=' . urlencode($path));
   } else {
    $path = end(explode('data/media/video/', trim($e->getAttribute('src'))));
    $path = 'data/media/video/' . $path;
    $path = '/clientarea/utils/locate-video?path=' . urlencode($path);
    $width = $e->getAttribute('width') . 'px';
    $height = $e->getAttribute('height') . 'px';
    $a = $doc->createElement('a', '');
    $a->setAttribute('href', $path);
    $a->setAttribute('style', "display:block;width:$width;height:$height;");
    $a->setAttribute('class', 'player');
    $e->parentNode->replaceChild($a, $e);
    $this->slideContainsVideo = true;
   }
  }

  $html = trim($doc->saveHTML());

  $html = explode('<body>', $html);
  $html = explode('</body>', $html[1]);
  return $html[0];
 }
私有函数ParseSlideContent($slideContent)
{
var_dump(iconv('Windows-1250','UTF-8',$slideContent));//输出带有所有特殊字符的HTML ok
$doc=新的DOMDocument('1.0','UTF-8');
//黑客保留UTF-8字符
$html=iconv('Windows-1250','UTF-8',$slideContent);
$doc->loadHTML(“”.$html);
$doc->preserveWhiteSpace=false;
//这可能有用
//它基本上只是将head和meta标记添加到文档中
$html=$doc->getElementsByTagName('html')->项(0);
$head=$doc->createElement('head','');
$meta=$doc->createElement('meta','');
$meta->setAttribute('http-equiv','Content-Type');
$meta->setAttribute('content','text/html;charset=utf-8');
$head->appendChild($meta);
$body=$doc->getElementsByTagName('body')->item(0);
$html->removeChild($body);
$html->appendChild($head);
$html->appendChild($body);
foreach($doc->getElementsByTagName('img')作为$t){
$path=trim($t->getAttribute('src'));
$t->setAttribute('src','/clientrea/utils/locate image?path='.urlencode('path));
}
foreach($doc->getElementsByTagName('object')作为$o){
foreach($o->getElementsByTagName('param')作为$p){
$path=trim($p->getAttribute('value'));
$p->setAttribute('value','/clientrea/utils/locate flash?path='.urlencode('path));
}
}
foreach($doc->getElementsByTagName('embed')作为$e){
if(true==$e->hasAttribute('pluginspage')){
$path=trim($e->getAttribute('src'));
$e->setAttribute('src','/clientrea/utils/locate flash?path='.urlencode('path));
}否则{
$path=end(分解('data/media/video/'),修剪($e->getAttribute('src'));
$path='data/media/video/'。$path;
$path='/clientrea/utils/locate video?path='.urlencode($path);
$width=$e->getAttribute('width').'px';
$height=$e->getAttribute('height').'px';
$a=$doc->createElement('a','');
$a->setAttribute('href',$path);
$a->setAttribute('style',“显示:块;宽度:$width;高度:$height;”);
$a->setAttribute('class','player');
$e->parentNode->replaceChild($a,$e);
$this->slideContainsVideo=true;
}
}
$html=trim($doc->saveHTML());
$html=分解(“”,$html);
$html=explode(“”,$html[1]);
返回$html[0];
}
两种解决方案

您可以将编码设置为标头:

<?php header("Content-Type", "text/html; charset=utf-8"); ?>

或者您可以将其设置为元标记:

<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=UTF-8">

编辑:如果这两项设置都正确,请执行以下操作:

  • 创建一个包含UTF-8字符的小页面
  • 用您已有的方法编写页面
  • 使用或检查在开发和生产环境中传输的原始字节。您还可以使用Fiddler/Wireshark双重检查标题
如果您确信正在发送正确的头,那么找到错误的最佳机会是开始查看原始字节。发送到相同b的相同字节
<?php
//script and output are in UTF-8

/* Simulate HTML fragment in Windows-1250 */
$html = <<<XML
<p>ĄĽź ‰ ‡ … á (some exist on win-1250, but not LATIN1 or even win-1252)</p>
XML;
$htmlInterm = iconv("UTF-8", "Windows-1250", $html); //convert

/* Append meta header to force UTF-8 interpretation and convert into UTF-8 */
$htmlInterm =
    "<meta http-equiv=\"Content-Type\" content=\"text/html; charset=UTF-8\" />" .
    iconv("Windows-1250", "UTF-8", $htmlInterm);

/* Omit libxml warnings */
libxml_use_internal_errors(true);

/* Build DOM */
$d = new domdocument;
$d->loadHTML($htmlInterm);
var_dump($d->getElementsByTagName("body")->item(0)->textContent); //correct UTF-8
string(79) "ĄĽź ‰ ‡ … á (some exist on win-1250, but not LATIN1 or even win-1252)"