允许用户在PHP中提交HTML_Php_Whitelist

允许用户在PHP中提交HTML

php

允许用户在PHP中提交HTML,php,whitelist,Php,Whitelist,我想允许很多用户提交的html用于用户配置文件，我目前试图过滤掉我不想要的内容，但我现在想更改并使用白名单方法以下是我目前的非白名单方法 function FilterHTML($string) { if (get_magic_quotes_gpc()) { $string = stripslashes($string); } $string = html_entity_decode($string, ENT_QUOTES, "ISO-8859-1");

我想允许很多用户提交的html用于用户配置文件，我目前试图过滤掉我不想要的内容，但我现在想更改并使用白名单方法

以下是我目前的非白名单方法

function FilterHTML($string) {
    if (get_magic_quotes_gpc()) {
        $string = stripslashes($string);
    }
    $string = html_entity_decode($string, ENT_QUOTES, "ISO-8859-1");
    // convert decimal
    $string = preg_replace('/&#(\d+)/me', "chr(\\1)", $string); // decimal notation
    // convert hex
    $string = preg_replace('/&#x([a-f0-9]+)/mei', "chr(0x\\1)", $string); // hex notation
    //$string = html_entity_decode($string, ENT_COMPAT, "UTF-8");
    $string = preg_replace('#(&\#*\w+)[\x00-\x20]+;#U', "$1;", $string);
    $string = preg_replace('#(<[^>]+[\s\r\n\"\'])(on|xmlns)[^>]*>#iU', "$1>", $string);
    //$string = preg_replace('#(&\#x*)([0-9A-F]+);*#iu', "$1$2;", $string); //bad line
    $string = preg_replace('#/*\*()[^>]*\*/#i', "", $string); // REMOVE /**/
    $string = preg_replace('#([a-z]*)[\x00-\x20]*([\`\'\"]*)[\\x00-\x20]*j[\x00-\x20]*a[\x00-\x20]*v[\x00-\x20]*a[\x00-\x20]*s[\x00-\x20]*c[\x00-\x20]*r[\x00-\x20]*i[\x00-\x20]*p[\x00-\x20]*t[\x00-\x20]*:#iU', '...', $string); //JAVASCRIPT
    $string = preg_replace('#([a-z]*)([\'\"]*)[\x00-\x20]*v[\x00-\x20]*b[\x00-\x20]*s[\x00-\x20]*c[\x00-\x20]*r[\x00-\x20]*i[\x00-\x20]*p[\x00-\x20]*t[\x00-\x20]*:#iU', '...', $string); //VBSCRIPT
    $string = preg_replace('#([a-z]*)[\x00-\x20]*([\\\]*)[\\x00-\x20]*@([\\\]*)[\x00-\x20]*i([\\\]*)[\x00-\x20]*m([\\\]*)[\x00-\x20]*p([\\\]*)[\x00-\x20]*o([\\\]*)[\x00-\x20]*r([\\\]*)[\x00-\x20]*t#iU', '...', $string); //@IMPORT
    $string = preg_replace('#([a-z]*)[\x00-\x20]*e[\x00-\x20]*x[\x00-\x20]*p[\x00-\x20]*r[\x00-\x20]*e[\x00-\x20]*s[\x00-\x20]*s[\x00-\x20]*i[\x00-\x20]*o[\x00-\x20]*n#iU', '...', $string); //EXPRESSION
    $string = preg_replace('#</*\w+:\w[^>]*>#i', "", $string);
    $string = preg_replace('#</?t(able|r|d)(\s[^>]*)?>#i', '', $string); // strip out tables
    $string = preg_replace('/(potspace|pot space|rateuser|marquee)/i', '...', $string); // filter some words
    //$string = str_replace('left:0px; top: 0px;','',$string);
    do {
        $oldstring = $string;
        //bgsound|
        $string = preg_replace('#</*(applet|meta|xml|blink|link|script|iframe|frame|frameset|ilayer|layer|title|base|body|xml|AllowScriptAccess|big)[^>]*>#i', "...", $string);
    } while ($oldstring != $string);
    return addslashes($string);
}

函数过滤器HTML（$string）{
如果（获取\u魔术\u引号\u gpc（））{
$string=stripslashes（$string）；
}
$string=html_实体_解码（$string，ENT_引号，“ISO-8859-1”）；
//十进制转换
$string=preg_replace（'/&#（\d+）/me'，“chr（\\1）”，$string）；//十进制表示法
//转换十六进制
$string=preg_replace（'/&#x（[a-f0-9]+）/mei'，“chr（0x\\1）”，$string）；//十六进制表示法
//$string=html_entity_decode（$string，ENT_COMPAT，“UTF-8”）；
$string=preg#U replace（'#（&\#*\w+[\x00-\x20]+#U'，“$1；”，$string）；
$string=preg#u replace（“#”（]+[\s\r\n\“\”）（在| xmlns上）[^>]*>#iU'，“$1>”，$string）；
//$string=preg#u replace（'#（&\#x*）（[0-9A-F]+）*#iu'，“$1$2；”，$string）；//行错误
$string=preg\u replace（'\\*\*（）[^>]*\*/\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\/**/
$string=preg\u replace（'.#（[a-z]*）[\x00-\x20]*（[\`\'\'\'']*）[\\x00-\x20]*j[\x00-\x20]*a[\x00-\x20]*a[\x00-\x20]*s[\x00-\x20]*c[\x00-\x20]*r[\x00-\x20]*i[\x00-\x20]*p[\x00-\x00-\x20]、\x00-\x00-\x00-\x20]、\x00-\x00-\x00-\x00-\x00-\x0]、$JAVASCRIPT][\x00-\x00-\x00-\x00-\x00-\x00-]
$string=preg\u replace（'.#（[a-z]*）（[\'\'\“]*）[\x00-\x20]*v[\x00-\x20]*b[\x00-\x20]*s[\x00-\x20]*c[\x00-\x20]*r[\x00-\x20]*i[\x00-\x20]*p[\x00-\x20]*t[\x00-\x20]*：[\x00-\x20]：\x00-\x20]：\x00-\x00-\x20]：\x00-\x00-\x20]，$string]：\VBSCRIPT；//脚本
$string=preg\u replace（'.#（[a-z]*）[\x00-\x20]*（[\\\]*）[\\\]*（[\\\\]*）[\x00-\x20]*i（[\\\\]*）[\x00-\x20]*m（[\\\\\\]*）[\x00-\x20]*p（[\\\\\\\]*）[\x00-\x20]*）[\\\\\\\\\]*）[\x00-\x20]*o（[\\\\\\\\\\\]*）[\\\\\\\\\\\\\\\]*）[\x20]*）[\x20]*）[\\\\\\\\\\\\\\\\\]*o-\x20]*）[\x00-
$string=preg\u replace（'#（[a-z]*）[\x00-\x20]*e[\x00-\x20]*x[\x00-\x20]*p[\x00-\x20]*r[\x00-\x20]*e[\x00-\x20]*s[\x00-\x20]*i[\x00-\x20]*o[\x00-\x20]*o[\x00-\x20]*n[\x00-\x20]]iU'，$string）表达式
$string=preg_replace（“#]*>#i'，”，$string）；
$string=preg_replace（'#]*）？>#i'，'$string）；//删除表格
$string=preg_replace（“/（potspace | potspace | rateuser | marquee）/i'，“…”，$string）；//过滤一些单词
//$string=str_replace（'left:0px；top:0px；'，''，''，$string）；
做{
$oldstring=$string；
//bgsound|
$string=preg_replace（“#]*>#i'，“…”，$string）；
}而（$oldstring！=$string）；
返回addslashes（$string）；
}

上面的方法很好用，我在使用它2年后从来没有遇到过任何问题，但是对于白名单方法，除了PHP之外，还有什么类似于stackoverflows C#方法的吗？

也许使用removeChild（）删除不允许的标记，然后获得结果，这样做更安全。用正则表达式过滤东西并不总是安全的，特别是当事情开始变得如此复杂的时候。黑客可以找到一种方法来欺骗你的过滤器，论坛和社交网络非常清楚这一点

例如，浏览器忽略了后面的空格，实际上这是一个非常简单的目标-你只需要从白名单标签列表中检查任何不是一些标签的东西，然后将它们从源代码中删除。这可以用一个正则表达式很容易地完成

function sanitize($html) {
  $whitelist = array(
    'b', 'i', 'u', 'strong', 'em', 'a'
  );

  return preg_replace("/<(^".implode("|", $whitelist).")(.*)>(.*)<\/(^".implode("|", $whitelist).")>/", "", $html);
}

函数清理（$html）{
$whitelist=数组(
‘b’、‘i’、‘u’、‘strong’、‘em’、‘a’
);
返回preg_replace（“/（.*）/”、“”、$html）；
}

我还没有对此进行测试，其中可能有一个错误，但您已经了解了它的工作原理。您可能还希望了解如何使用格式语言，如Textile或Markdown

Jamie是最好的HTML解析器/清理器

是一个符合标准的HTML过滤器用PHP.HTML净化器编写的库不仅会删除所有恶意使用彻底审计，安全吗允许的白名单，它也将确保你的文件是正确的符合标准，仅此而已可通过全面的了解W3C规范

您可以只使用（）函数

因为函数定义为

string strip_tags  ( string $str  [, string $allowable_tags  ] )

您可以这样做：

$html = $_POST['content'];
$html = strip_tags($html, '<b><a><i><u><span>');

$html=$\u POST['content']；
$html=strip_标记（$html，'

尝试下面的“getCleanHTML”功能，从元素中提取文本内容，但白名单中带有标记名的元素除外。此代码干净，易于理解和调试

<?php

$TagWhiteList = array(
    'b', 'i', 'u', 'strong', 'em', 'a', 'img'
);

function getHTMLCode($Node) {
    $Document = new DOMDocument();    
    $Document->appendChild($Document->importNode($Node, true));
    return $Document->saveHTML();
}
function getCleanHTML($Node, $Text = "") {
    global $TagWhiteList;

    $TextName = $Node->tagName;
    if ($TextName == null)
        return $Text.$Node->textContent;

    if (in_array($TextName, $TagWhiteList)) 
        return $Text.getHTMLCode($Node);

    $Node = $Node->firstChild;
    if ($Node != null)
        $Text = getCleanHTML($Node, $Text);

    while($Node->nextSibling != null) {
        $Text = getCleanHTML($Node->nextSibling, $Text);
        $Node = $Node->nextSibling;
    }
    return $Text;
}

$Doc = new DOMDocument();
$Doc->loadHTMLFile("Test.html");
echo getCleanHTML($Doc->documentElement)."\n";

?>

希望这能有所帮助。

对于那些建议只使用strip_标签的人……请注意：不会去掉标签属性，而损坏的标签也会把它弄糟

从手册页面：

警告因为strip_tags（）实际上并没有验证HTML、部分或损坏的标记，可能会导致删除比预期更多的文本/数据

警告此功能不修改标记上您需要的任何属性允许使用允许的_标记，包括样式和属性淘气的用户在将显示给用户的过帐文本其他用户

你不能只依靠这一个解决方案。

使用php，这是一条真正的道路。它的输出是惊人的和安全的。我以前看过这个，我认为它非常庞大，但我会再次检查它，感谢我所需要的，并搜索了大约半个小时，直到我看到你的帖子！：-）谢谢

<?php

$TagWhiteList = array(
    'b', 'i', 'u', 'strong', 'em', 'a', 'img'
);

function getHTMLCode($Node) {
    $Document = new DOMDocument();    
    $Document->appendChild($Document->importNode($Node, true));
    return $Document->saveHTML();
}
function getCleanHTML($Node, $Text = "") {
    global $TagWhiteList;

    $TextName = $Node->tagName;
    if ($TextName == null)
        return $Text.$Node->textContent;

    if (in_array($TextName, $TagWhiteList)) 
        return $Text.getHTMLCode($Node);

    $Node = $Node->firstChild;
    if ($Node != null)
        $Text = getCleanHTML($Node, $Text);

    while($Node->nextSibling != null) {
        $Text = getCleanHTML($Node->nextSibling, $Text);
        $Node = $Node->nextSibling;
    }
    return $Text;
}

$Doc = new DOMDocument();
$Doc->loadHTMLFile("Test.html");
echo getCleanHTML($Doc->documentElement)."\n";

?>