C#Regex：从多个；a href"-标签_C#_Html_Regex

C#Regex：从多个；a href"-标签

c# html regex

C#Regex：从多个；a href"-标签,c#,html,regex,C#,Html,Regex,我希望能够抓取包含多个“ 所以我想要这些值： |经典链接 |我喜欢HTML 5 /my-贡品-to-javascript.html |我也喜欢JS 如您所见，只应捕获“a href”中的值，标记中包含链接和内容。它应支持所有HTML 5-validhref。href属性可以用任何其他属性包围所以我基本上需要一个正则表达式来填充以下代码： public IEnumerable<Tuple<string, string>> GetLinks(string html)

我希望能够抓取包含多个“

所以我想要这些值：

|经典链接
|我喜欢HTML 5
/my-贡品-to-javascript.html |我也喜欢JS

如您所见，只应捕获“a href”中的值，标记中包含链接和内容。它应支持所有HTML 5-validhref。href属性可以用任何其他属性包围
所以我基本上需要一个正则表达式来填充以下代码：

public IEnumerable<Tuple<string, string>> GetLinks(string html) { string pattern = string.Empty; // TODO: Get solution from Stackoverflow var matches = Regex.Matches(html, pattern); foreach(Match match in matches) { yield return new Tuple<string, string>( match.Groups[0].Value, match.Groups[1].Value); } }

public IEnumerable GetLinks（字符串html）{ string pattern=string.Empty；//TODO:从Stackoverflow获取解决方案 var matches=Regex.matches（html，模式）； foreach（匹配中的匹配）{ 产生返回新元组( match.Groups[0]。值，match.Groups[1]。值）； } }
是否比正则表达式更易于使用和使用xpath
就像

var webGet = new HtmlWeb(); var document = webGet.Load(url); var aNodeCollection = document.DocumentNode.Descendants("//a[@href]") foreach (HtmlNode node id aNodeCollection) { node.Attributes["href"].value node.htmltext }

它的伪代码
我一直读到用正则表达式解析Html是一件坏事。好吧……这肯定是真的……
但就像邪恶一样，正则表达式也很有趣：）
所以我想试试这个：

Regex r = new Regex(@"<a.*?href=(""|')(?<href>.*?)(""|').*?>(?<value>.*?)</a>"); foreach (Match match in r.Matches(html)) yield return new Tuple<string, string>( match.Groups["href"].Value, match.Groups["value"].Value);

Regex r=newregex（@“）； foreach（r.Matches中的匹配（html））产生返回新元组( match.Groups[“href”].Value，match.Groups[“Value”].Value）；
“TODO:从Stackoverflow获取解决方案”-真的吗？那么“TODO:尝试找出一个解决方案，如果我被卡住，请检查Stackoverflow”如何“？@nnnnnn明白了，不许开玩笑。。。非常有建设性的评论。我道歉，当然可以开玩笑。在我睡眠不足的状态下，我没有意识到这是一个笑话，否则我就不会发表这样的评论。（我有时会发布“到目前为止你都做了些什么？”之类的评论，但为了公平起见，你的问题提供了大量关于你的需求的细节和一些代码，因此它不符合通常的“为我做我的工作”问题的轮廓。）有趣的方法，但它特别指出HTML 5，这不一定是有效的XML。我仍然没有时间深入研究html5，所以我不知道它允许格式错误的文档（看起来像是退一步），但我仍然会尝试，agility pack对我来说很好，即使是讨厌的htmls，它也能很好地清理它们
Regex r = new Regex(@"<a.*?href=(""|')(?<href>.*?)(""|').*?>(?<value>.*?)</a>"); foreach (Match match in r.Matches(html)) yield return new Tuple<string, string>( match.Groups["href"].Value, match.Groups["value"].Value);