Php 通过DOM解析器在前置标记之间转换空格

Php 通过DOM解析器在前置标记之间转换空格,php,html,dom,html-parsing,Php,Html,Dom,Html Parsing,Regex是我作为解决方案的最初想法,尽管很快就发现DOM解析器更合适。。。我想将HTML文本字符串中的前置标记之间的空格转换为。例如: <table atrr="zxzx"><tr> <td>adfa a adfadfaf></td><td><br /> dfa dfa</td> </tr></table> <pre class="abc" id="abc"> a

Regex是我作为解决方案的最初想法,尽管很快就发现DOM解析器更合适。。。我想将HTML文本字符串中的前置标记之间的空格转换为
。例如:

<table atrr="zxzx"><tr>
<td>adfa a   adfadfaf></td><td><br /> dfa  dfa</td>
</tr></table>
<pre class="abc" id="abc">
abc 123
<span class="abc">abc 123</span>
</pre>
<pre>123 123</pre>
$dom = new DOMDocument;
$dom->loadHtml($html);
$xp = new DOMXPath($dom);
foreach ($xp->query('//text()[ancestor::pre]') as $textNode)
{
    $remaining = $textNode;
    while (($nextSpace = strpos($remaining->wholeText, ' ')) !== FALSE) {
        $remaining = $remaining->splitText($nextSpace);
        $remaining->nodeValue = substr($remaining->nodeValue, 1);
        $remaining->parentNode->insertBefore(
            $dom->createEntityReference('nbsp'),
            $remaining
        );
    }
}

adfa a adfadfaf>
dfa dfa abc 123 abc 123 123 123
插入(请注意,“跨距标记”属性中的空格被保留):


adfa a adfadfaf>
dfa dfa abc 123 abc 123 123 123

需要将结果序列化回字符串格式,以便在其他地方使用。

如果要插入
实体而无需DOM将符号转换为
&实体,因为实体是节点,而空间只是字符数据。以下是如何做到这一点:

DOMElement pre
    DOMText "abc"
    DOMEntity nbsp
    DOMText "123"
    DOMElement span
       DOMText "abc"
       DOMEntity nbsp
       DOMText "123"
DOMElement
    DOMText "123"
    DOMEntity nbsp
    DOMText "123"
获取所有pre元素并使用它们的nodeValue在这里不起作用,因为nodeValue属性将包含所有子元素的组合DOMText值,例如,它将包括span子元素的nodeValue。在pre元素上设置nodeValue将删除这些元素

因此,我们不获取pre节点,而是获取所有在其轴上某处具有pre元素父节点的DOMText节点:

$innerHtml = '';
foreach ($dom->getElementsByTagName('body')->item(0)->childNodes as $child) {
    $tmp_doc = new DOMDocument();
    $tmp_doc->appendChild($tmp_doc->importNode($child,true));
    $innerHtml .= $tmp_doc->saveHTML();
}
echo $innerHtml;
因为我们只处理DOMText节点,所以任何DomeElement都保持不变,因此它将保留pre元素中的span元素

注意事项:

您的代码段无效,因为它没有根元素。当使用loadHTML时,libxml将向DOM添加任何缺少的结构,这意味着您将获得包含DOCTYPE、html和body标记的代码段

如果要恢复原始代码段,必须
getElementsByTagName
主体节点并获取所有子节点以获取
innerHTML
。不幸的是,我们必须手动执行此操作:

<p>paragraph 1 remains untouched</p>
<pre>preformatted 1</pre>
<div>
    <pre>preformatted 2</pre>
</div>
<div>
    <pre>preformatted 3 <span class="foo">span text</span> preformatted 3</pre>
</div>
<div>
    <pre>preformatted 4 <span class="foo">span <b class="bla">bold test</b> text</span> preformatted 3</pre>
</div>
也看到


我看到了我先前答案的不足之处。这里有一个在
中保留标记的解决方法,这可能会去除
标记中的标记,这与我在前面的回答中遇到的问题相同。
<p>paragraph 1 remains untouched</p>
<pre>preformatted&nbsp;1</pre>
<div>
    <pre>preformatted&nbsp;2</pre>
</div>
<div>
    <pre>preformatted&nbsp;3&nbsp;<span class="foo">span&nbsp;text</span>&nbsp;preformatted&nbsp;3</pre>
</div>
<div>
    <pre>preformatted&nbsp;4&nbsp;<span class="foo">span&nbsp;<b class="bla">bold&nbsp;test</b>&nbsp;text</span>&nbsp;preformatted&nbsp;3</pre>
</div>
<?php
$test = file_get_contents('input.html');
$dom = new DOMDocument('1.0');
$dom->loadHTML($test);
$xpath = new DOMXpath($dom);
$pre = $xpath->query('//pre//text()');
// manipulate nodes of type XML_TEXT_NODE
foreach($pre as $e) {
    $e->nodeValue = str_replace(' ', '__REPLACEMELATER__', $e->nodeValue);
    // when you attempt to write &nbsp; in a dom node
    // the & will be converted to &amp; :(
}
$temp = $dom->saveHTML();
$temp = str_replace('<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">', '', $temp);
$temp = str_replace('<html>', '', $temp);
$temp = str_replace('<body>', '', $temp);
$temp = str_replace('</body>', '', $temp);
$temp = str_replace('</html>', '', $temp);
$temp = str_replace('__REPLACEMELATER__', '&nbsp;', $temp);
echo $temp;
?>
<p>paragraph 1 remains untouched</p>
<pre>preformatted 1</pre>
<div>
    <pre>preformatted 2</pre>
</div>
<div>
    <pre>preformatted 3 <span class="foo">span text</span> preformatted 3</pre>
</div>
<div>
    <pre>preformatted 4 <span class="foo">span <b class="bla">bold test</b> text</span> preformatted 3</pre>
</div>
<p>paragraph 1 remains untouched</p>
<pre>preformatted&nbsp;1</pre>
<div>
    <pre>preformatted&nbsp;2</pre>
</div>
<div>
    <pre>preformatted&nbsp;3&nbsp;<span class="foo">span&nbsp;text</span>&nbsp;preformatted&nbsp;3</pre>
</div>
<div>
    <pre>preformatted&nbsp;4&nbsp;<span class="foo">span&nbsp;<b class="bla">bold&nbsp;test</b>&nbsp;text</span>&nbsp;preformatted&nbsp;3</pre>
</div>
$e->nodeValue = utf8_encode(str_replace(' ', "\xA0", $e->nodeValue));
// dom library will attempt to convert 0xA0 to &nbsp;
// nodeValue expects utf-8 encoded data but 0xA0 is not valid in this encoding
// hence replaced string must be utf-8 encoded