基本网页爬网问题:如何使用php创建网站上所有页面的列表?

基本网页爬网问题:如何使用php创建网站上所有页面的列表?,php,web-crawler,Php,Web Crawler,我想用php创建一个爬虫程序,它会给我一个特定域上所有页面的列表(从主页开始:www.example.com) 如何在php中实现这一点 我不知道如何递归查找网站上的所有页面,从特定页面开始,不包括外部链接。对于一般方法,请查看以下问题的答案: 在PHP中,您应该能够简单地使用获取远程URL。您可以通过使用正则表达式来执行HTML的简单解析(对于某些典型的方法,可以找到) 一旦提取了raw href属性,您就可以使用它来分解它的组件,并确定它是否是您要获取的URL—请记住,URL可能与

我想用php创建一个爬虫程序,它会给我一个特定域上所有页面的列表(从主页开始:www.example.com)

如何在php中实现这一点


我不知道如何递归查找网站上的所有页面,从特定页面开始,不包括外部链接。

对于一般方法,请查看以下问题的答案:

在PHP中,您应该能够简单地使用获取远程URL。您可以通过使用正则表达式来执行HTML的简单解析(对于某些典型的方法,可以找到

一旦提取了raw href属性,您就可以使用它来分解它的组件,并确定它是否是您要获取的URL—请记住,URL可能与您获取的页面相关

尽管速度很快,但正则表达式并不是解析HTML的最佳方法-您也可以尝试解析您获取的HTML,例如:

$dom = new DOMDocument();
$dom->loadHTML($content);

$anchors = $dom->getElementsByTagName('a');

if ( count($anchors->length) > 0 ) {
    foreach ( $anchors as $anchor ) {
        if ( $anchor->hasAttribute('href') ) {
            $url = $anchor->getAttribute('href');

            //now figure out whether to processs this
            //URL and add it to a list of URLs to be fetched
        }
    }
}
最后,与其自己写,还不如看看这个问题,了解您可以使用的其他资源


对于一般方法,请查看以下问题的答案:

在PHP中,您应该能够简单地使用获取远程URL。您可以通过使用正则表达式来执行HTML的简单解析(对于某些典型的方法,可以找到

一旦提取了raw href属性,您就可以使用它来分解它的组件,并确定它是否是您要获取的URL—请记住,URL可能与您获取的页面相关

尽管速度很快,但正则表达式并不是解析HTML的最佳方法-您也可以尝试解析您获取的HTML,例如:

$dom = new DOMDocument();
$dom->loadHTML($content);

$anchors = $dom->getElementsByTagName('a');

if ( count($anchors->length) > 0 ) {
    foreach ( $anchors as $anchor ) {
        if ( $anchor->hasAttribute('href') ) {
            $url = $anchor->getAttribute('href');

            //now figure out whether to processs this
            //URL and add it to a list of URLs to be fetched
        }
    }
}
最后,与其自己写,还不如看看这个问题,了解您可以使用的其他资源

概述

下面是一些关于爬虫程序基础知识的注释

It is a console app - It doesn't need a rich interface, so I figured a console application would do. The output is done as an html file and the input (what site to view) is done through the app.config. Making a windows app out of this seemed like overkill.
The crawler is designed to only crawl the site it originally targets. It would be easy to change that if you want to crawl more than just a single site, but that is the goal of this little application.
Originally the crawler was just written to find bad links. Just for fun I also had it collect information on page and viewstate sizes. It will also list all non-html files and external urls, just in case you care to see them.
The results are shown in a rather minimalistic html report. This report is automatically opened in Internet Explorer when the crawl is finished.
从Html页面获取文本

构建爬虫程序的第一个关键部分是从web(或您的本地机器,如果您的站点在本地运行)中取出html的机制。与其他许多东西一样,.NET在框架中内置了用于完成这项工作的类

    private static string GetWebText(string url)
    {
        HttpWebRequest request = (HttpWebRequest)HttpWebRequest.Create(url);
        request.UserAgent = "A .NET Web Crawler";
        WebResponse response = request.GetResponse();
        Stream stream = response.GetResponseStream();
        StreamReader reader = new StreamReader(stream);
        string htmlText = reader.ReadToEnd();
        return htmlText;
    }
HttpWebRequest类可用于从internet请求任何页面。响应(通过调用GetResponse()检索)包含所需的数据。获取响应流,将其放入StreamReader,然后读取文本以获取html。 供参考:

概述

下面是一些关于爬虫程序基础知识的注释

It is a console app - It doesn't need a rich interface, so I figured a console application would do. The output is done as an html file and the input (what site to view) is done through the app.config. Making a windows app out of this seemed like overkill.
The crawler is designed to only crawl the site it originally targets. It would be easy to change that if you want to crawl more than just a single site, but that is the goal of this little application.
Originally the crawler was just written to find bad links. Just for fun I also had it collect information on page and viewstate sizes. It will also list all non-html files and external urls, just in case you care to see them.
The results are shown in a rather minimalistic html report. This report is automatically opened in Internet Explorer when the crawl is finished.
从Html页面获取文本

构建爬虫程序的第一个关键部分是从web(或您的本地机器,如果您的站点在本地运行)中取出html的机制。与其他许多东西一样,.NET在框架中内置了用于完成这项工作的类

    private static string GetWebText(string url)
    {
        HttpWebRequest request = (HttpWebRequest)HttpWebRequest.Create(url);
        request.UserAgent = "A .NET Web Crawler";
        WebResponse response = request.GetResponse();
        Stream stream = response.GetResponseStream();
        StreamReader reader = new StreamReader(stream);
        string htmlText = reader.ReadToEnd();
        return htmlText;
    }
HttpWebRequest类可用于从internet请求任何页面。响应(通过调用GetResponse()检索)包含所需的数据。获取响应流,将其放入StreamReader,然后读取文本以获取html。
仅供参考:

你这样做是为了好玩吗?因为有很多免费的,预先制作的网络爬虫。你这样做是为了好玩吗?因为有很多免费的、预先制作好的网络爬虫,问题是关于PHP的。NET答案有什么帮助?问题是关于PHP的。NET答案有什么帮助?