C# 如何从不在特定字符之间的字符串中提取URL？_C#_Regex

C# 如何从不在特定字符之间的字符串中提取URL？

c# regex

C# 如何从不在特定字符之间的字符串中提取URL？,c#,regex,C#,Regex,我想提取一个字符串中的所有URL，该字符串不在两个特定字符之间。如果URL位于以下字符之间，则不应提取URL：和 >其中一种方法是使用下面是一段示例代码，它从HTML片段中所有元素的内部文本中提取所有URL。您可以对其进行修改，以仅提取您需要的内容： var content = "<strong>http://www.helloworld.com/test</strong> with a hyperlink <a href=\"www.google.com\"&

我想提取一个字符串中的所有URL，该字符串不在两个特定字符之间。如果URL位于以下字符之间，则不应提取URL：

和

>其中一种方法是使用

下面是一段示例代码，它从HTML片段中所有元素的内部文本中提取所有URL。您可以对其进行修改，以仅提取您需要的内容：

var content = "<strong>http://www.helloworld.com/test</strong> with a hyperlink <a href=\"www.google.com\">www.google.com</a> and also a normal link www.youtube.com dsdsd sometext http://www.website.com/test sdfsdfsdfg ssdgsdf sdfsdfsdf";
HtmlAgilityPack.HtmlDocument document = new HtmlAgilityPack.HtmlDocument();
document.LoadHtml(content);                        
Regex regex = new Regex(@"(?:http(s)?:\/\/)?[\w.-]+(?:\.[\w\.-]+)+[\w\-\._~:/?#[\]@!\$&'\(\)\*\+,;=.]+", RegexOptions.Compiled);
FindMatchesInText(document.DocumentNode, regex);       

private void FindMatchesInText(HtmlNode parentNode, Regex regex)
{                        
    foreach (var node in parentNode.ChildNodes)
    {                
        var match = regex.Match(node.InnerText);
        while(match.Success)
        {
            Console.WriteLine(match.Value);
            match = match.NextMatch();
        }
        //Recurse
        FindMatchesInText(node, regex);
    }            
}

输出：

www.google.com

www.youtube.com

规则1，永远不要使用正则表达式来解析HTML。规则2，如果您需要使用正则表达式解析HTML，请参见规则1@LuisHenrique是的，所有URL's我建议用HtmlAgilityPack解析它，省略所有a标记及其内容，然后在纯文本节点内搜索。@LuisHenrique否检查我的注释代码将被发布和解释，但被标记为重复，而事实并非如此。所以，我只能给你这个现在不能与社区分享的东西，对不起。

(http://|https://|ftp://|mailto:|www\.){1}(?![^>]*<)(?![^"]*")[^^\\\"\n\s\}\{\|\`<>~]*

var content = "<strong>http://www.helloworld.com/test</strong> with a hyperlink <a href=\"www.google.com\">www.google.com</a> and also a normal link www.youtube.com dsdsd sometext http://www.website.com/test sdfsdfsdfg ssdgsdf sdfsdfsdf";
HtmlAgilityPack.HtmlDocument document = new HtmlAgilityPack.HtmlDocument();
document.LoadHtml(content);                        
Regex regex = new Regex(@"(?:http(s)?:\/\/)?[\w.-]+(?:\.[\w\.-]+)+[\w\-\._~:/?#[\]@!\$&'\(\)\*\+,;=.]+", RegexOptions.Compiled);
FindMatchesInText(document.DocumentNode, regex);       

private void FindMatchesInText(HtmlNode parentNode, Regex regex)
{                        
    foreach (var node in parentNode.ChildNodes)
    {                
        var match = regex.Match(node.InnerText);
        while(match.Success)
        {
            Console.WriteLine(match.Value);
            match = match.NextMatch();
        }
        //Recurse
        FindMatchesInText(node, regex);
    }            
}