C# 如何过滤除某个白名单之外的所有HTML标记？_C#_Html_Vb.net_Regex

C# 如何过滤除某个白名单之外的所有HTML标记？

c# html vb.net regex

C# 如何过滤除某个白名单之外的所有HTML标记？,c#,html,vb.net,regex,C#,Html,Vb.net,Regex,这是给.NET的。已设置IgnoreCase，但未设置多行通常我在regex很不错，也许我的咖啡因太少了允许用户输入HTML编码的实体( 如下所述，“>”在属性值中是合法的，但可以肯定地说，我不支持它。此外，也不会有CDATA块等需要担心的问题。只需要一点HTML Loophole的答案是目前为止最好的，谢谢！以下是他的模式（希望预习对我更好）：静态字符串清理html（字符串html） { string acceptable=“script | link | title”； String

这是给.NET的。已设置IgnoreCase，但未设置多行

通常我在regex很不错，也许我的咖啡因太少了

允许用户输入HTML编码的实体( 如下所述，“>”在属性值中是合法的，但可以肯定地说，我不支持它。此外，也不会有CDATA块等需要担心的问题。只需要一点HTML

Loophole的答案是目前为止最好的，谢谢！以下是他的模式（希望预习对我更好）：

静态字符串清理html（字符串html）
{
string acceptable=“script | link | title”；
String String模式＝@“p>属性是使用正则表达式尝试使用HTML的主要问题。考虑潜在属性的绝对数量，以及它们中大多数都是可选的，以及它们可以以任意顺序出现的事实，以及是引用属性值中的合法字符。当您开始尝试考虑所有这些时，您需要处理的正则表达式将很快变得无法管理
我要做的是使用一个基于事件的HTML解析器，或者提供一个可以遍历的DOM树。
这是一个很好的HTML标记过滤示例：
添加单词边界\b不起作用的原因是您没有将其放在“向前看”中。因此，\ b将在<之后尝试，如果<启动HTML标记，它将始终匹配
将其放在“向前看”中，如下所示：
<(?!/?(i|b|h3|h4|a|img)\b)[^>]+>

]+>

这还显示了如何将/放在标记列表之前，而不是每个标记。
我想我最初打算将值设置为可选值，但没有执行，因为我可以看到我在等号后面添加了一个？
，并将匹配的值部分分组。让我们在该组后面添加一个？
（用carot标记）使其在匹配中也是可选的。我现在不在我的编译器处，但看看这是否有效：
@"</?(?(?=" + acceptable + @")notag|[a-z,A-Z,0-9]+)(?:\s[a-z,A-Z,0-9,\-]+=?(?:(["",']?).*?\1?)?)*\s*/?>";
                                                                                             ^

@“以下是我为此任务编写的函数：
static string SanitizeHtml(string html)
{
    string acceptable = "script|link|title";
    string stringPattern = @"</?(?(?=" + acceptable + @")notag|[a-zA-Z0-9]+)(?:\s[a-zA-Z0-9\-]+=?(?:(["",']?).*?\1?)?)*\s*/?>";
    return Regex.Replace(html, stringPattern, "sausage");
}

静态字符串清理html（字符串html）
{
string acceptable=“script | link | title”；
String String模式＝@“p”>我注意到当前的解决方案允许从任何可接受的标签开始标签。因此，如果“B”是一个可接受的标签，“眨眼”也是。不是一个大的交易，而是要考虑的是，如果你对如何过滤HTML是严格的。当然，你肯定不想让“S”成为一个可接受的标签，因为它允许“脚本”。.
//
///修剪忽略指定的标记
/// 
///要从中删除html的文本
///指定是否要删除脚本
///指定剥离时要忽略的标记
///精简文本
公共静态字符串StripHtml（字符串文本、bool isRemoveScript、参数字符串[]ignorableTags）
{
如果（！string.IsNullOrEmpty（text））
{
text=text.Replace（“，”）；
字符串ignorePattern=null；
if（isRemoveScript）
{
text=Regex.Replace（text，“构建在HTML Agility Pack之上，并具有用于清理标记的简单语法
方法HtmlSanitizer.SimpleHtml5Sanitizer（）
生成一个包含我所需一切的消毒剂，但这里有一个更动态的方法：
public static string GetLimitedHtml(string value)
{
    var sanitizer = HtmlSanitizer.SimpleHtml5Sanitizer();
    var allowed = new string[] {"br", "h1", "h2", "h3", "h4", "h5", "h6", "small", "strike", "strong", "b"};
    foreach (var tag in allowed)
    {
        sanitizer.Tag(tag);
    }
    
    return sanitizer.Sanitize(value);
}

lol…在最后一个字符范围内仍然有一个逗号。感谢更新！我在OP中调整了代码。感谢代码！这段代码是更新的还是应该从表达式中删除逗号？只是为了添加一个注意事项，我的html输入来自外部源，它有一个无效的br标记“工作非常好。我对正则表达式进行了一些调整，以包括智能标记（Office格式主要是，）。string stringPattern=@“Tedd，这个答案已经有8年历史了。如果你有更好的方法，请随意发布你自己的答案。请删除不必要的[常规]Tag你是否有幸删除属性？漏洞中的答案似乎不是这样的？重构代码网站已经关闭了一段时间。我相信它已经不再使用了。@Sohimsso1970，是的，我直到现在才注意到，这是2010年9月的存档网页：看代码这是最严格和最好的代码我在这里看到了gex的答案。我看不出它有什么直接的缺陷，尽管我建议不要尝试用正则表达式对HTML进行清理。实际答案应该包含在帖子中。如果链接出错，这个答案就一文不值了。堆栈溢出101人。你能解释一下为什么以及如何回答这个问题吗？这是解决方案实际上满足了我的需要。我需要除去（link）标记以外的所有html…string[]ignorableTags={“a”}；StripHtml（mytextwithlinks，true，ignorableTags）；这是一个非常糟糕的解决方案。它不仅会弄乱您的HTML代码，而且实际上只有在标记具有严格的结束标记时才会删除标记。因此，只需在结束标记后添加空格，即可允许恶意代码：警报（“利用”）它试图成为一个黑名单者，而不是白名单者。因此，任何未知的东西都会很高兴地通过它。我知道你在说什么，只需将正则表达式更改为与我对已接受答案的评论相同：不安全，容易绕过。
 Dim AcceptableTags As String = "i|b|u|sup|sub|ol|ul|li|br|h2|h3|h4|h5|span|div|p|a|img|blockquote"
 Dim WhiteListPattern As String = "</?(?(?=" & AcceptableTags & _
      ")notag|[a-zA-Z0-9]+)(?:\s[a-zA-Z0-9\-]+=?(?:([""']?).*?\1?)?)*\s*/?>"
 html = Regex.Replace(html, WhiteListPattern, "", RegExOptions.Compiled)

<(?!/?(i|b|h3|h4|a|img)\b)[^>]+>

@"</?(?(?=" + acceptable + @")notag|[a-z,A-Z,0-9]+)(?:\s[a-z,A-Z,0-9,\-]+=?(?:(["",']?).*?\1?)?)*\s*/?>";
                                                                                             ^

static string SanitizeHtml(string html)
{
    string acceptable = "script|link|title";
    string stringPattern = @"</?(?(?=" + acceptable + @")notag|[a-zA-Z0-9]+)(?:\s[a-zA-Z0-9\-]+=?(?:(["",']?).*?\1?)?)*\s*/?>";
    return Regex.Replace(html, stringPattern, "sausage");
}

    /// <summary>
    /// Trims the ignoring spacified tags
    /// </summary>
    /// <param name="text">the text from which html is to be removed</param>
    /// <param name="isRemoveScript">specify if you want to remove scripts</param>
    /// <param name="ignorableTags">specify the tags that are to be ignored while stripping</param>
    /// <returns>Stripped Text</returns>
    public static string StripHtml(string text, bool isRemoveScript, params string[] ignorableTags)
    {
        if (!string.IsNullOrEmpty(text))
        {
            text = text.Replace("&lt;", "<");
            text = text.Replace("&gt;", ">");
            string ignorePattern = null;

            if (isRemoveScript)
            {
                text = Regex.Replace(text, "<script[^<]*</script>", string.Empty, RegexOptions.IgnoreCase);
            }
            if (!ignorableTags.Contains("style"))
            {
                text = Regex.Replace(text, "<style[^<]*</style>", string.Empty, RegexOptions.IgnoreCase);
            }
            foreach (string tag in ignorableTags)
            {
                //the character b spoils the regex so replace it with strong
                if (tag.Equals("b"))
                {
                    text = text.Replace("<b>", "<strong>");
                    text = text.Replace("</b>", "</strong>");
                    if (ignorableTags.Contains("strong"))
                    {
                        ignorePattern = string.Format("{0}(?!strong)(?!/strong)", ignorePattern);
                    }
                }
                else
                {
                    //Create ignore pattern fo the tags to ignore
                    ignorePattern = string.Format("{0}(?!{1})(?!/{1})", ignorePattern, tag);
                }

            }
            //finally add the ignore pattern into regex <[^<]*> which is used to match all html tags
            ignorePattern = string.Format(@"<{0}[^<]*>", ignorePattern);
            text = Regex.Replace(text, ignorePattern, "", RegexOptions.IgnoreCase);
        }

        return text;
    }

public static string GetLimitedHtml(string value)
{
    var sanitizer = HtmlSanitizer.SimpleHtml5Sanitizer();
    var allowed = new string[] {"br", "h1", "h2", "h3", "h4", "h5", "h6", "small", "strike", "strong", "b"};
    foreach (var tag in allowed)
    {
        sanitizer.Tag(tag);
    }
    
    return sanitizer.Sanitize(value);
}