Php 如何从html标记中删除属性？_Php_Html Parsing

Php 如何从html标记中删除属性？

php

Php 如何从html标记中删除属性？,php,html-parsing,Php,Html Parsing,如何使用php从标记（比如段落标记）中去除所有/任何属性到是使用PHP清理HTML的更好工具之一。虽然有更好的方法，但实际上可以使用正则表达式从HTML标记中删除参数： <?php function stripArgumentFromTags( $htmlString ) { $regEx = '/([^<]*<\s*[a-z](?:[0-9]|[a-z]{0,9}))(?:(?:\s*[a-z\-]{2,14}\s*=\s*(?:"[^"]*"|\'[^\']*\'

如何使用php从标记（比如段落标记）中去除所有/任何属性


到是使用PHP清理HTML的更好工具之一。
虽然有更好的方法，但实际上可以使用正则表达式从HTML标记中删除参数：
<?php
function stripArgumentFromTags( $htmlString ) {
    $regEx = '/([^<]*<\s*[a-z](?:[0-9]|[a-z]{0,9}))(?:(?:\s*[a-z\-]{2,14}\s*=\s*(?:"[^"]*"|\'[^\']*\'))*)(\s*\/?>[^<]*)/i'; // match any start tag

    $chunks = preg_split($regEx, $htmlString, -1,  PREG_SPLIT_DELIM_CAPTURE);
    $chunkCount = count($chunks);

    $strippedString = '';
    for ($n = 1; $n < $chunkCount; $n++) {
        $strippedString .= $chunks[$n];
    }

    return $strippedString;
}
?>



上面的内容可能用更少的字符编写，但它确实起到了作用（快速而肮脏）。
您也可以查看html净化器。诚然，它非常臃肿，如果只关注这个特定的例子，它可能不适合您的需要，但它或多或少提供了对可能的恶意html的“防弹”净化。您还可以选择允许或不允许某些属性（它是高度可配置的）
使用SimpleXML剥离属性（PHP5中的标准）
xpath（'substant:：*[@*]）作为$tag）{
//循环属性
foreach（$tag->attributes（）作为$name=>$value）{
//检查允许的属性
if（！in_数组（$name，$allowed_atts））{
//将属性值设置为空字符串
$tag->attributes（）->$name=''；
//收集要剥离的属性模式
$strip_arr[$name]='/'.$name'='/'；
}
}
}
}
//剥离不允许的属性和根标记
$data\u str=strip\u标记（preg\u replace（$strip\u arr，数组（“”），$data\u sxml->asXML（）），$allowed\u标记）；
?>
这里有一个函数，可以让您去除除所需属性以外的所有属性：
function stripAttributes($s, $allowedattr = array()) {
  if (preg_match_all("/<[^>]*\\s([^>]*)\\/*>/msiU", $s, $res, PREG_SET_ORDER)) {
   foreach ($res as $r) {
     $tag = $r[0];
     $attrs = array();
     preg_match_all("/\\s.*=(['\"]).*\\1/msiU", " " . $r[1], $split, PREG_SET_ORDER);
     foreach ($split as $spl) {
      $attrs[] = $spl[0];
     }
     $newattrs = array();
     foreach ($attrs as $a) {
      $tmp = explode("=", $a);
      if (trim($a) != "" && (!isset($tmp[1]) || (trim($tmp[0]) != "" && !in_array(strtolower(trim($tmp[0])), $allowedattr)))) {

      } else {
          $newattrs[] = $a;
      }
     }
     $attrs = implode(" ", $newattrs);
     $rpl = str_replace($r[1], $attrs, $tag);
     $s = str_replace($tag, $rpl, $s);
   }
  }
  return $s;
}

或
您可以echo$消息用于预览。
老实说，我认为唯一明智的方法是在库中使用标记和属性白名单。此处的示例脚本：
<html><body>

<?php

require_once '../includes/htmlpurifier-4.5.0-lite/library/HTMLPurifier/Bootstrap.php';
spl_autoload_register(array('HTMLPurifier_Bootstrap', 'autoload'));

$config = HTMLPurifier_Config::createDefault();
$config->set('HTML.Allowed', 'p,b,a[href],i,br,img[src]');
$config->set('URI.Base', 'http://www.example.com');
$config->set('URI.MakeAbsolute', true);

$purifier = new HTMLPurifier($config);

$dirty_html = "
  <a href=\"http://www.google.de\">broken a href link</a
  fnord

  <x>y</z>
  <b>c</p>
  <script>alert(\"foo!\");</script>

  <a href=\"javascript:alert(history.length)\">Anzahl besuchter Seiten</a>
  <img src=\"www.example.com/bla.gif\" />
  <a href=\"http://www.google.de\">missing end tag
 ende 
";

$clean_html = $purifier->purify($dirty_html);

print "<h1>dirty</h1>";
print "<pre>" . htmlentities($dirty_html) . "</pre>";

print "<h1>clean</h1>";
print "<pre>" . htmlentities($clean_html) . "</pre>";

?>

</body></html>

这非常有效，但前提是您的输入html是正确格式的xml。否则，您必须在解析之前对输入html进行一些预清理。如果您也不能完全控制源html输入，那么这可能会非常繁琐。
echo stripAttributes('<p class="one" otherrandomattribute="two">');

echo stripAttributes('<p class="one" otherrandomattribute="two">', array('class'));

$message = stripAttributes($_POST['message']);

<html><body>

<?php

require_once '../includes/htmlpurifier-4.5.0-lite/library/HTMLPurifier/Bootstrap.php';
spl_autoload_register(array('HTMLPurifier_Bootstrap', 'autoload'));

$config = HTMLPurifier_Config::createDefault();
$config->set('HTML.Allowed', 'p,b,a[href],i,br,img[src]');
$config->set('URI.Base', 'http://www.example.com');
$config->set('URI.MakeAbsolute', true);

$purifier = new HTMLPurifier($config);

$dirty_html = "
  <a href=\"http://www.google.de\">broken a href link</a
  fnord

  <x>y</z>
  <b>c</p>
  <script>alert(\"foo!\");</script>

  <a href=\"javascript:alert(history.length)\">Anzahl besuchter Seiten</a>
  <img src=\"www.example.com/bla.gif\" />
  <a href=\"http://www.google.de\">missing end tag
 ende 
";

$clean_html = $purifier->purify($dirty_html);

print "<h1>dirty</h1>";
print "<pre>" . htmlentities($dirty_html) . "</pre>";

print "<h1>clean</h1>";
print "<pre>" . htmlentities($clean_html) . "</pre>";

?>

</body></html>

<a href="http://www.google.de">broken a href link</a>fnord

y
<b>c
<a>Anzahl besuchter Seiten</a>
<img src="http://www.example.com/www.example.com/bla.gif" alt="bla.gif" /><a href="http://www.google.de">missing end tag
ende 
</a></b>

$config->set('HTML.Allowed', 'p');