匹配特定url模式的php正则表达式_Php_Regex_Url_Pattern Matching

匹配特定url模式的php正则表达式

php regex url

匹配特定url模式的php正则表达式,php,regex,url,pattern-matching,Php,Regex,Url,Pattern Matching,我想从几百个html页面中“抓取”几百个URL 模式： <h2><a href="http://www.the.url.might.be.long/urls.asp?urlid=1" target="_blank">The Website</a></h2> 但是最好使用HTML解析器，这里有一个例子 $html=file\u get\u html（'http://www.google.com/'); //查找所有链接 foreach（$htm

我想从几百个html页面中“抓取”几百个URL

模式：

<h2><a href="http://www.the.url.might.be.long/urls.asp?urlid=1" target="_blank">The Website</a></h2>

但是最好使用HTML解析器，这里有一个例子

$html=file\u get\u html（'http://www.google.com/');
//查找所有链接
foreach（$html->find（'a'）作为$element）
echo$element->href
'；

以下是如何正确使用本机DOM扩展

// GET file
$doc = new DOMDocument;
$doc->loadHtmlFile('http://example.com/');

// Run XPath to fetch all href attributes from a elements
$xpath = new DOMXPath($doc);
$links = $xpath->query('//a/@href');

// collect href attribute values from all DomAttr in array
$urls = array();
foreach($links as $link) {
    $urls[] = $link->value;
}
print_r($urls);

请注意，上面的内容也会找到相关链接。如果您不希望这些更改，请将Xpath调整为

'//a/@href[starts-with(., "http")]'

请注意，使用正则表达式匹配HTML是一条疯狂之路。正则表达式匹配字符串模式，对HTML元素和属性一无所知。DOM是这样的，这就是为什么对于超出匹配标记中的超平凡字符串模式的所有情况，您都应该选择它而不是正则表达式的原因。

它们。。。只是。。。从未。。。停下来。小马托尼。。他来了。。。。

// GET file
$doc = new DOMDocument;
$doc->loadHtmlFile('http://example.com/');

// Run XPath to fetch all href attributes from a elements
$xpath = new DOMXPath($doc);
$links = $xpath->query('//a/@href');

// collect href attribute values from all DomAttr in array
$urls = array();
foreach($links as $link) {
    $urls[] = $link->value;
}
print_r($urls);

'//a/@href[starts-with(., "http")]'