Php 使用DOMDocument剪切HTML会输出无效字符_Php_Html_Domdocument

Php 使用DOMDocument剪切HTML会输出无效字符

php html

Php 使用DOMDocument剪切HTML会输出无效字符,php,html,domdocument,Php,Html,Domdocument,我正在使用PHP中的DOMDocument类剪切出几行文本。这里的文本是由WYSIWYG编辑器输入的一大块HTML 我使用的代码如下所示： $body_string .= '<p class="summary">'; $domd = new DOMDocument(); $domd->encoding = 'utf-8'; libxml_use_internal_errors(true); $domd->loadHTML(utf8_decode($post['conte

我正在使用PHP中的DOMDocument类剪切出几行文本。这里的文本是由WYSIWYG编辑器输入的一大块HTML

我使用的代码如下所示：

$body_string .= '<p class="summary">';

$domd = new DOMDocument();
$domd->encoding = 'utf-8';
libxml_use_internal_errors(true);
$domd->loadHTML(utf8_decode($post['content']));
libxml_use_internal_errors(false);

$domx = new DOMXPath($domd);
$items = $domx->query("//p[position() = 1] | //div[position() = 1]");

$body_string .= substr($items->item(0)->textContent, 0, 230);
$body_string .= '</p>

$domd = new DOMDocument();
libxml_use_internal_errors(true);
$domd->loadHTML('<meta charset="utf-8">' . $post['content']);
libxml_use_internal_errors(false);

$body_string.='；
$domd=新的DOMDocument（）；
$domd->encoding='utf-8'；
libxml\u使用\u内部错误（true）；
$domd->loadHTML（utf8_解码（$post['content']）；
libxml\u使用\u内部错误（false）；
$domx=新的DOMXPath（$domd）；
$items=$domx->query（//p[position（）=1]|//div[position（）=1]）；
$body_string.=substr（$items->item（0）->textContent，0230）；
$body_string.='

但是，当字符串具有特殊字符（如省略号或卷曲引号）时，它们会变成问号

这样的文本：

$body_string .= '<p class="summary">';

$domd = new DOMDocument();
$domd->encoding = 'utf-8';
libxml_use_internal_errors(true);
$domd->loadHTML(utf8_decode($post['content']));
libxml_use_internal_errors(false);

$domx = new DOMXPath($domd);
$items = $domx->query("//p[position() = 1] | //div[position() = 1]");

$body_string .= substr($items->item(0)->textContent, 0, 230);
$body_string .= '</p>

$domd = new DOMDocument();
libxml_use_internal_errors(true);
$domd->loadHTML('<meta charset="utf-8">' . $post['content']);
libxml_use_internal_errors(false);

我们知道，TED的谈话有时会让人觉得有点……言过其实。在那里有很多伟大的演讲；他们中的一些人哪儿也不去，似乎也不去给你的生活增加很多。更糟糕的是…有很多泰德说话很难说

变成这样：

我们知道，TED谈话有时会让人感觉有点紧张？夸大其词。在那里有很多伟大的演讲；他们中的一些人哪儿也不去，似乎也不去给你的生活增加很多。让事情变得更糟？有很多泰德说话很难说

只有在使用DOMDocument类时才会发生这种情况。没有它，字符就不会转换为问号

我怎样才能解决这个问题？HTML文档在

中已经有一个

，似乎无法复制该文档，请尝试以下解决方法：

$body_string .= '<p class="summary">';

$domd = new DOMDocument('1.0', 'utf-8');
libxml_use_internal_errors(true);
$domd->loadHTML(mb_convert_encoding($post['content'], 'HTML-ENTITIES', 'UTF-8'));
libxml_clear_errors();

$domx = new DOMXPath($domd);
$items = $domx->query("//p[position() = 1] | //div[position() = 1]");

$body_string .= substr($items->item(0)->textContent, 0, 230);
$body_string .= '</p>

$body_string.='；
$domd=新的DOMDocument（'1.0'，'utf-8'）；
libxml\u使用\u内部错误（true）；
$domd->loadHTML（mb_convert_编码（$post['content']，'HTML-ENTITIES'，'UTF-8'）；
libxml_clear_errors（）；
$domx=新的DOMXPath（$domd）；
$items=$domx->query（//p[position（）=1]|//div[position（）=1]）；
$body_string.=substr（$items->item（0）->textContent，0230）；
$body_string.='

最接近的东西是可复制的。

设置

DOMDocument:：encoding

仅在将DOMDocument打印为字符串时用于更改编码，因此在这里没有任何效果

类似地，在

DOMDocument

构造函数中设置“utf-8”也没有效果，因为它只在从头开始创建新文档时使用，而不是在解析现有文档时使用

HTML解析器需要知道发布内容的编码，如下所示：

$body_string .= '<p class="summary">';

$domd = new DOMDocument();
$domd->encoding = 'utf-8';
libxml_use_internal_errors(true);
$domd->loadHTML(utf8_decode($post['content']));
libxml_use_internal_errors(false);

$domx = new DOMXPath($domd);
$items = $domx->query("//p[position() = 1] | //div[position() = 1]");

$body_string .= substr($items->item(0)->textContent, 0, 230);
$body_string .= '</p>

$domd = new DOMDocument();
libxml_use_internal_errors(true);
$domd->loadHTML('<meta charset="utf-8">' . $post['content']);
libxml_use_internal_errors(false);

$domd=newdomdocument（）；
libxml\u使用\u内部错误（true）；
$domd->loadHTML（''.$post['content']）；
libxml\u使用\u内部错误（false）；

你能提供一个可以复制你这个问题的字符串的例子吗？@Ghost我不知道你的意思。。。看到上面的文字了吗？我试图在我的环境中复制这个，我的意思是

$post['content']

的内容。但不管怎样，我在这里能做的唯一一件事就是猜一猜，看看引用的文本。这是在

$post['content']

@maxxon15我很高兴这有帮助你能解释一下我做错了什么吗？我仍然不明白我做错了什么…：）@maxxon15它基本上是先做的，因为

->loadHTML（）

默认为

ISO-8859-1

它会弄乱这些实体。因此，在加载它们之前，解决方法是将它们转换为html实体。@maxxon15关于这一部分，这里有一个非常雄辩的答案。我想这是哈克雷的答案，但我找不到那个帖子。对不起，如果我不能很好地解释，因为英语不是我的第一语言language@maxxon15是的，基本上转换是在将其输入到

->loadHTML