C# 什么正则表达式适合从HTML中提取URL？_C#_Regex_Url

C# 什么正则表达式适合从HTML中提取URL？

c# regex url

C# 什么正则表达式适合从HTML中提取URL？,c#,regex,url,C#,Regex,Url,我试过在StackOverflow上使用我自己的和最上面的，但大多数都超出了预期例如，有些会提取http://foo.com/hello?world您的正则表达式需要对最后一个字符组中的破折号“-”进行转义： @"((https?|ftp|gopher|telnet|file|notes|ms-help):((//)|(\\\\))+[\w\d:#@%/;$()~_?\+\-=\\\.&^]*)" 本质上，您允许从+到=（其中包括最安全的正则表达式）的字符完全不使用正则表达式，而是使

我试过在StackOverflow上使用我自己的和最上面的，但大多数都超出了预期

例如，有些会提取

http://foo.com/hello?world您的正则表达式需要对最后一个字符组中的破折号“-”进行转义：
@"((https?|ftp|gopher|telnet|file|notes|ms-help):((//)|(\\\\))+[\w\d:#@%/;$()~_?\+\-=\\\.&^]*)"

本质上，您允许从+到=（其中包括最安全的正则表达式）的字符完全不使用正则表达式，而是使用System.Uri类

Uri=新的Uri（“http://myUrl/%2E%2E/%2E%2E");
Console.WriteLine（uri.AbsoluteUri）；
WriteLine（uri.PathAndQuery）；
试试这个：
    public static string[] Parse(string pattern, string groupName, string input)
    {
        var list = new List<string>();

        var regex = new Regex(pattern, RegexOptions.IgnoreCase);
        for (var match = regex.Match(input); match.Success; match = match.NextMatch())
        {
            list.Add(string.IsNullOrWhiteSpace(groupName) ? match.Value : match.Groups[groupName].Value);
        }

        return list.ToArray();
    }

    public static string[] ParseUri(string input)
    {
        const string pattern = @"(?<Protocol>\w+):\/\/(?<Domain>[\w@][\w.:@]+)\/?[\w\.?=%&=\-@/$,]*";

        return Parse(pattern, string.Empty, input);
    }

publicstaticstring[]解析（字符串模式、字符串组名、字符串输入）
{
var list=新列表（）；
var regex=新regex（模式，RegexOptions.IgnoreCase）；
for（var match=regex.match（输入）；match.Success；match=match.NextMatch（））
{
list.Add（string.IsNullOrWhiteSpace（groupName）？match.Value:match.Groups[groupName].Value）；
}
return list.ToArray（）；
}
公共静态字符串[]解析URI（字符串输入）
{
常量字符串模式=@“（？\w+）：\/\/（？[\w@][\w.：@]+）\/？[\w\.？=%&=\-@/$，]*”；
返回解析（模式、字符串、空、输入）；
}
这些字符有什么无效之处？HTML显然不应该与URL一起收集。为什么要使用正则表达式呢<代码>http://foo.com/bar?flubHTML 不应该被收集？我认为pst是对的，我认为我们手头有一个问题。你到底想解决什么问题？@Drake我不知道如何使用正则表达式从HTML可靠地提取链接（在所有上下文中），而不限制“可接受的URL”是什么。如果可接受URL的范围减少（例如，“无法包含此收集的数千个空行和单个单词，我需要一个匹配URL的模式而不是验证它们。mailto？…自定义处理程序…？@RitchMelton而不是（https？| ftp | gopher | telnet | file | notes | ms help），请使用（\w+）@Jason-这会停止匹配任何内容。我的观点是，这种类型的正则表达式不是一个好主意。它如何停止匹配任何内容？\\w+相当于[a-z0-9-+--这将包含（https？| ftp | gopher | telnet | file | notes | ms help）以及您建议的任何自定义处理程序。在这种情况下，Regex实际上是非常可行的。使用测试，效果非常好。
    public static string[] Parse(string pattern, string groupName, string input)
    {
        var list = new List<string>();

        var regex = new Regex(pattern, RegexOptions.IgnoreCase);
        for (var match = regex.Match(input); match.Success; match = match.NextMatch())
        {
            list.Add(string.IsNullOrWhiteSpace(groupName) ? match.Value : match.Groups[groupName].Value);
        }

        return list.ToArray();
    }

    public static string[] ParseUri(string input)
    {
        const string pattern = @"(?<Protocol>\w+):\/\/(?<Domain>[\w@][\w.:@]+)\/?[\w\.?=%&=\-@/$,]*";

        return Parse(pattern, string.Empty, input);
    }