Php 如何从RSS源中提取所有URL链接？_Php_Mysql_Regex_Xml_Rss

Php 如何从RSS源中提取所有URL链接？

php mysql regex xml rss

Php 如何从RSS源中提取所有URL链接？,php,mysql,regex,xml,rss,Php,Mysql,Regex,Xml,Rss,我需要定期从MySQL数据库中提取指向新闻文章的所有链接。我该怎么做呢？我可以使用一些正则表达式（PHP）来匹配链接吗？还是有其他的替代方法？提前感谢。更新2我测试了下面的代码，不得不修改 $links = $dom->getElementsByTagName('a'); 并将其更改为： $links = $dom->getElementsByTagName('link'); 它成功地输出了链接。祝你好运更新这里似乎有一个完整的答案：我开发了一个解决方案，

我需要定期从MySQL数据库中提取指向新闻文章的所有链接。我该怎么做呢？我可以使用一些正则表达式（PHP）来匹配链接吗？还是有其他的替代方法？提前感谢。

更新2我测试了下面的代码，不得不修改

    $links = $dom->getElementsByTagName('a');

并将其更改为：

    $links = $dom->getElementsByTagName('link');

它成功地输出了链接。祝你好运

更新这里似乎有一个完整的答案：

我开发了一个解决方案，这样我可以递归我的网站中的所有链接。我已经删除了验证域与每个递归相同的代码（因为问题没有要求这样做），但是如果需要的话，您可以轻松地添加一个

使用html5 DOMDocument，您可以解析HTML或XML文档以读取链接。这比使用正则表达式要好。试试这样的

<?php
//300 seconds = 5 minutes - or however long you need so php won't time out
ini_set('max_execution_time', 300); 

// using a global to store the links in case there is recursion, it makes it easy. 
// You could of course pass the array by reference for cleaner code.
$alinks = array();

// set the link to whatever you are reading
$link = "http://rss.nytimes.com/services/xml/rss/nyt/HomePage.xml";

// do the search
linksearch($link, $alinks);

// show results
var_dump($alinks);

function linksearch($url, & $alinks) {
    // use $queue if you want this fn to be recursive
    $queue = array();
    echo "<br>Searching: $url";

    $href = array();
    //Load the HTML page
    $html = file_get_contents($url);

    //Create a new DOM document
    $dom = new DOMDocument;

    //Parse the HTML. The @ is used to suppress any parsing errors
    //that will be thrown if the $html string isn't valid XHTML.
    @$dom->loadHTML($html);

    //Get all links. You could also use any other tag name here,
    //like 'img' or 'table', to extract other tags.
    $links = $dom->getElementsByTagName('link');

    //Iterate over the extracted links and display their URLs
    foreach ($links as $link){

        //Extract and show the "href" attribute. 
        $href[] = $link->getAttribute('href');
    }    
    foreach (array_unique($href) as $link) {            
        // add to list of links found
        $queue[] = $link;
    }

    // remove duplicates
    $queue = array_unique($queue);

    // get links that haven't yet been processed
    $queue = array_diff($queue, $alinks);

    // update array passed by reference with new links found
    $alinks = array_merge($alinks, $queue);

    if (count($queue) > 0) {
        foreach ($queue as $link) {
            // recursive search - uncomment out if you use this
            // remember to check that the domain is the same as the one starting from
            // linksearch($link, $alinks);
        }
    }
}

DOM+Xpath允许您使用表达式获取节点

RSS项目链接要获取RSS链接元素（每个项目的链接）：

原子链

atom:link

具有不同的语义，它们是atom名称空间的一部分，用于描述关系。纽约时报使用突出关系标记特色故事。要获取Atom链接，需要为名称空间注册前缀。属性也是节点，因此您可以直接获取它们：

$xml = file_get_contents($url);
$document = new DOMDocument();
$document->loadXml($xml);
$xpath = new DOMXPath($document);
$xpath->registerNamespace('a', 'http://www.w3.org/2005/Atom');

$expression = '//channel/item/a:link[@rel="standout"]/@href';

foreach ($xpath->evaluate($expression) as $link) {
  var_dump($link->value);
}

这里有

prev

和

next

HTML链接（

元素）

description

元素包含HTML片段。要从中提取链接，必须将HTML加载到单独的DOM文档中

$xml = file_get_contents($url);
$document = new DOMDocument();
$document->loadXml($xml);
$xpath = new DOMXPath($document);
$xpath->registerNamespace('a', 'http://www.w3.org/2005/Atom');

$expression = '//channel/item/description';

foreach ($xpath->evaluate($expression) as $description) {
  $fragment = new DOMDocument();
  $fragment->loadHtml($description->textContent);
  $fragmentXpath = new DOMXpath($fragment);
  foreach ($fragmentXpath->evaluate('//a[@href]/@href') as $link) {
    var_dump($link->value);
  } 
}

当然可以。RSS只是一个XML文档，因此您可以轻松地解析它。你对此有什么特别的问题吗？我是web开发的新手，所以不知道如何解决这个问题。好的，我建议你先学习一些关于PHP和RSS/XML的知识！）我已经测试了我的答案，并更新了我的代码，以完成阅读网站上的链接@chris85虽然这可能是重复的，但我发现有不同的标记名用于存储链接（

link

与

）。我没有在帖子中发现这一点，尽管我确信它包含在大量的链接信息中。

$xml = file_get_contents($url);
$document = new DOMDocument();
$document->loadXml($xml);
$xpath = new DOMXPath($document);
$xpath->registerNamespace('a', 'http://www.w3.org/2005/Atom');

$expression = '//channel/item/description';

foreach ($xpath->evaluate($expression) as $description) {
  $fragment = new DOMDocument();
  $fragment->loadHtml($description->textContent);
  $fragmentXpath = new DOMXpath($fragment);
  foreach ($fragmentXpath->evaluate('//a[@href]/@href') as $link) {
    var_dump($link->value);
  } 
}