Php 如何清理HTML源部件？_Php_Regex_Algorithm_Typo3_Fluid

Php 如何清理HTML源部件？

php regex algorithm typo3

Php 如何清理HTML源部件？,php,regex,algorithm,typo3,fluid,Php,Regex,Algorithm,Typo3,Fluid,作为搜索结果，我得到了搜索词周围的内容。但这只是整个页面的一个子部分，它只包括搜索词附近的标签。如果匹配（开始/结束）比较远，我会使用不平衡的HTML标记。当浏览器试图平衡页面布局时，这些不平衡的标记可能会影响页面布局，因为它使用了其他级别的标记示例这可能是整个页面： <li> <h3>Ang my oniuse.</h3> <p>Oh! any or said faing ear Dand and tion on so wor st

作为搜索结果，我得到了搜索词周围的内容。但这只是整个页面的一个子部分，它只包括搜索词附近的标签。如果匹配（开始/结束）比较远，我会使用不平衡的HTML标记。当浏览器试图平衡页面布局时，这些不平衡的标记可能会影响页面布局，因为它使用了其他级别的标记

示例

这可能是整个页面：

<li>
  <h3>Ang my oniuse.</h3> 
  <p>Oh! any or said faing ear Dand and tion on so wor st wouter and abox 
  a makess stand he he sne at mon the nany ing a me come hink floney a 
  naiday. Smiler yousee lurneremiley boll his a grog.</p>
</li>
<li>
  <h3>I'l hat seelectler</h3> 
  <p> Imay e ney, agat nould a fiver, and and hishuch what gook, ley hires
  he cand and onius mon'l, handent a flit's and, th whey, hat wou used his
  thend that ance, he ned and me lood says wou hed set pidays far it
  conted, and seell yarty.</p>
</li>

现在，p标记和li标记不平衡，使用结束标记，浏览器尝试关闭p标记（可能位于整个找到的文本周围）和li标记（可能位于每个找到的条目周围）。
但是这些标签的下一个开头有错误的css类，li和p之间的一些div标签现在不匹配，最后的结尾可能会从column layout中关闭div标签

结果：整个页面布局被破坏

期望的结果可以是（所有未配对的标签都是成对的，这不是万无一失的）：


奈迪。斯迈勒：你看，勒内尔·埃米利·波尔是他的一个女人


我是希莱克勒
伊玛依·尼，一个五块钱的人，还有他那傻瓜，莱伊雇的人
他可以而且只有我们一个人，递了一张传单，而且，天哪，帽子会用他的

或：

naiday。斯迈勒：你看，勒内尔·埃米利·波尔是他的一个女人。
我是希莱克勒
伊玛依·尼，一个五块钱的人，还有他那傻瓜，莱伊雇的人
他可以而且只有我们一个人，递了一张传单，而且，天哪，帽子会用他的

但此解决方案可能会丢失重要布局，例如换行符

是否存在可以通过添加缺少的部分或删除剩余部分来清除不平衡HTML标记的viewhelper？

是否有用于检测不平衡标记的算法/regexp？

我建议从搜索结果中删除所有html标记。并使用纯文本搜索结果

可能会通过使用换行符替换某些标记来创建一些次要的“格式”

我找到的最接近的解决方案是使用此viewhelper：

<?php
namespace MyCompany\MyExtension\ViewHelpers;

use TYPO3\CMS\Fluid\Core\ViewHelper\AbstractViewHelper;

/**
 * fills in missing xml tags
 */
class BalanceXmlViewHelper extends AbstractViewHelper
{

    /**
     * balances XML-fragment with additional tags
     *
     * @param string $xmlIn
     * @return string
     */
    public function render($xmlIn = null)
    {
        if (null === $xmlIn) {
            $xmlIn = $this->renderChildren();
        }

        $xmlDoc = new \DOMDocument();
        // it's UTF-8 data!
        $xmlDoc->loadHTML('<?xml encoding="UTF-8">' . $xmlIn
              // we want no complete HTML-document, so neglect some default-tags
            , LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD | LIBXML_NOERROR | LIBXML_NOWARNING | LIBXML_NOXMLDECL
        );

        // remove the additional charset tag and replace german umlauts
        $retVal = html_entity_decode(mb_substr($xmlDoc->saveHTML(),23)
                                    ,ENT_COMPAT | ENT_HTML401
                                    );


        return $retVal;
    }
}

”$西林
//我们不需要完整的HTML文档，所以忽略一些默认标记
，LIBXML_HTML_noimpled | LIBXML_HTML_NODEFDTD | LIBXML_NOERROR | LIBXML_NOWARNING | LIBXML_NOXMLDECL
);
//移除附加字符集标签并更换德国umlauts
$retVal=html\u entity\u decode（mb\u substr（$xmlDoc->saveHTML（），23）
，ENT|u COMPAT | ENT|u HTML401
);
返回$retVal；
}
}

我知道它可以保留无效标记（例如，没有UL的LI标记），但它比删除所有标记（stripHTML（））更精确，删除块标记后，将生成不带换行符甚至空白的文本。

请详细说明具体需要什么。预期结果添加我认为其html解析器在索引搜索扩展中存在问题。它将返回带有html标记的结果。您必须手动格式化结果。检查这个答案。希望这对你有帮助！它不是索引搜索，而是索引内容的solr

<li><p>
  naiday. Smiler yousee lurneremiley boll his a grog.</p>
</li>
<li>
  <h3>I'l hat <b>seelectler</b></h3> 
  <p> Imay e ney, agat nould a fiver, and and hishuch what gook, ley hires
  he cand and onius mon'l, handent a flit's and, th whey, hat wou used his
</p></li>

  naiday. Smiler yousee lurneremiley boll his a grog.
  <h3>I'l hat <b>seelectler</b></h3> 
  Imay e ney, agat nould a fiver, and and hishuch what gook, ley hires
  he cand and onius mon'l, handent a flit's and, th whey, hat wou used his

<?php
namespace MyCompany\MyExtension\ViewHelpers;

use TYPO3\CMS\Fluid\Core\ViewHelper\AbstractViewHelper;

/**
 * fills in missing xml tags
 */
class BalanceXmlViewHelper extends AbstractViewHelper
{

    /**
     * balances XML-fragment with additional tags
     *
     * @param string $xmlIn
     * @return string
     */
    public function render($xmlIn = null)
    {
        if (null === $xmlIn) {
            $xmlIn = $this->renderChildren();
        }

        $xmlDoc = new \DOMDocument();
        // it's UTF-8 data!
        $xmlDoc->loadHTML('<?xml encoding="UTF-8">' . $xmlIn
              // we want no complete HTML-document, so neglect some default-tags
            , LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD | LIBXML_NOERROR | LIBXML_NOWARNING | LIBXML_NOXMLDECL
        );

        // remove the additional charset tag and replace german umlauts
        $retVal = html_entity_decode(mb_substr($xmlDoc->saveHTML(),23)
                                    ,ENT_COMPAT | ENT_HTML401
                                    );


        return $retVal;
    }
}