如何使用php从网站中提取特定类型的链接？_Php_Regex

如何使用php从网站中提取特定类型的链接？

php regex

如何使用php从网站中提取特定类型的链接？,php,regex,Php,Regex,我正在尝试使用php从网页中提取特定类型的链接链接如下我想提取像在上述格式的所有链接 maindomain.com/pages/SomeNumber/SomeText 到目前为止，我可以从网页中提取所有的链接，但上面的过滤器没有发生。我怎样才能做到这一点有什么建议吗这只是一个小小的猜测，但如果我猜错了，你仍然可以看到方法 foreach ($links as $link){ //Extract and show the "href" attribute. If(preg_ma

我正在尝试使用php从网页中提取特定类型的链接

链接如下

我想提取像在上述格式的所有链接

maindomain.com/pages/SomeNumber/SomeText

到目前为止，我可以从网页中提取所有的链接，但上面的过滤器没有发生。我怎样才能做到这一点

有什么建议吗

这只是一个小小的猜测，但如果我猜错了，你仍然可以看到方法

foreach ($links as $link){
  //Extract and show the "href" attribute.
  If(preg_match("/(?:http.*)maindomain\.com\/pages\/\d+\/.*/",$link->getAttribute('href')){
       echo $link->nodeValue;
       echo $link->getAttribute('href'), '<br>';
  }
}

foreach（$links作为$link）{
//提取并显示“href”属性。
如果（preg_match（“/（？：http.*）main domain\.com\/pages\/\d+\/.*/”，$link->getAttribute（'href'））{
echo$link->nodeValue；
echo$link->getAttribute（'href'），“
”；
}
}

您可以使用DOMXPath并向注册函数，以便在XPATH查询中使用它：

function checkURL($url) {
    $parts = parse_url($url);
    unset($parts['scheme']);

    if ( count($parts) == 2    &&
         isset($parts['host']) &&
         isset($parts['path']) &&
         preg_match('~^/pages/[0-9]+/[^/]+$~', $parts['path']) ) {
        return true;
    }
    return false;
}

libxml_use_internal_errors(true);

$dom = new DOMDocument;
$dom->loadHTMLFile($filename);

$xp = new DOMXPath($dom);

$xp->registerNamespace("php", "http://php.net/xpath");
$xp->registerPhpFunctions('checkURL');

$links = $xp->query("//a[php:functionString('checkURL', @href)]");

foreach ($links as $link) {
    echo $link->getAttribute('href'), PHP_EOL;
}

通过这种方式，您可以只提取所需的链接。

您已经使用了解析器，因此可以向前一步，在DOM上使用xpath查询。xpath查询还提供以下功能，因此这可能会起作用：

$xpath = new DOMXpath($dom);
$links = $xpath->query("//a[starts-with(@href, 'maindomain.com')]");

然后在它们上面循环：

foreach ($links as $link) {
    // do sth. with it here
    // after all, it is a DOMElement
}

虽然regex不是HTML的好朋友，但我认为如果链接位于页面的“区域”中，它可能会起作用。然后可以使用strps（）剪切该部分，然后使用strip_tags（）删除可能导致regex出现问题的标记。你有任何示例数据吗？等等……你只需要regex？？类似于

（？：http.*）的东西maindomain\.com\/pages\/\d+\/.

谢谢你的回复。我怎么能在我提供的代码中使用上面的正则表达式呢？该死的，我来晚了-正在摆弄我的答案：）我喜欢你的解决方案（+1），但为什么不使用

开头-with（）

在这种情况下？@Jan:

以开头

不能同时使用和不使用URL的方案部分。我认为目标不是专门测试什么是主机，而是测试路径的外观。请注意，如果需要，可以使用函数轻松添加其他检查。

foreach ($links as $link) {
    // do sth. with it here
    // after all, it is a DOMElement
}