Php 要查找html中的所有链接，哪种方法正确？正则表达式或解析DOM_Php_Html_Regex_Dom_Href

Php 要查找html中的所有链接，哪种方法正确？正则表达式或解析DOM

php html regex dom

Php 要查找html中的所有链接，哪种方法正确？正则表达式或解析DOM,php,html,regex,dom,href,Php,Html,Regex,Dom,Href,我想获取html中的所有href链接。我遇到了两种可能的方法。一个是正则表达式： $input = urldecode(base64_decode($html_file)); $regexp = "href\s*=\s*(\"??)([^\" >]*?)\\1[^>]*>(.*)\s*"; if(preg_match_all("/$regexp/siU", $input, $matches, PREG_SET_ORDER)) { foreach($matches

我想获取html中的所有href链接。我遇到了两种可能的方法。一个是正则表达式：

$input = urldecode(base64_decode($html_file));
 $regexp = "href\s*=\s*(\"??)([^\" >]*?)\\1[^>]*>(.*)\s*";
 if(preg_match_all("/$regexp/siU", $input, $matches, PREG_SET_ORDER)) {
     foreach($matches as $match) {
           echo $match[2] ;//= link address
           echo $match[3]."<br>" ;//= link text
      }
  }

$input=urldecode（base64_decode（$html_文件））；
$regexp=“href\s*=\s*（\”？）（[^\“>]*？）\\1[^>]*>（.*）”；
if（预匹配全部（“/$regexp/siU”、$input、$matches、预设置顺序））{
foreach（$matches作为$match进行匹配）{
echo$match[2]；//=链接地址
echo$match[3]。“
”；//=链接文本
}
}

另一个是创建DOM文档并对其进行解析：

             $html = urldecode(base64_decode($html_file));
             //Create a new DOM document
             $dom = new DOMDocument;

            //Parse the HTML. The @ is used to suppress any parsing errors
             //that will be thrown if the $html string isn't valid XHTML.
             @$dom->loadHTML($html);

            //Get all links. You could also use any other tag name here,
            //like 'img' or 'table', to extract other tags.
            $links = $dom->getElementsByTagName('a');

            //Iterate over the extracted links and display their URLs
            foreach ($links as $link){
                //Extract and show the "href" attribute.
                     echo $link->nodeValue;
                     echo $link->getAttribute('href'), '<br>';
            }

$html=urldecode（base64_decode（$html_文件））；
//创建新的DOM文档
$dom=新的DOMDocument；
//解析HTML。@用于抑制任何分析错误
//如果$html字符串不是有效的XHTML，则会引发该错误。
@$dom->loadHTML（$html）；
//获取所有链接。您也可以在此处使用任何其他标记名，
//例如“img”或“table”，以提取其他标记。
$links=$dom->getElementsByTagName（'a'）；
//迭代提取的链接并显示它们的URL
foreach（$links作为$link）{
//提取并显示“href”属性。
echo$link->nodeValue；
echo$link->getAttribute（'href'），“
”；
}

我不知道这其中哪一个是有效的。但代码将被多次使用。所以我想澄清哪一个更好。谢谢大家!

那么

jsoup

呢？基准测试肯定是最好的选择吗？毫无疑问，你不应该用正则表达式解析HTML！对于相对较小的文档使用DOM或SimpleXML，对于较大的文档使用SAX/pull解析器，例如XML解析器、XmlReader。我投票将这个问题作为离题题来结束，因为这个问题应该问你的正则表达式如何处理一个关于HTML的网页，用示例描述

href

-属性？这将包含大量“href=”内容，而不是链接。。。。