Parsing 简单的html dom来抓取整个网站_Parsing_Simple Html Dom_Web Crawler

Parsing 简单的html dom来抓取整个网站

parsing web-crawler

Parsing 简单的html dom来抓取整个网站,parsing,simple-html-dom,web-crawler,Parsing,Simple Html Dom,Web Crawler,我想抓取整个网站。我使用Simple_html_dom进行解析，但问题是一次只需要一个网页链接。我想只提供开始（主页）链接，它应该爬网和自动解析该网站的所有网页。有什么建议吗？当解析单个页面的DOM时，将所有链接（在同一个域中）存储在一个数组中。然后，在解析结束时，检查数组是否为空。如果不是这样，那么使用第一个链接并执行相同的操作比如（代码示例是用类似Python的语法编写的，但您可以很容易地将其应用于PHP—我的已经过时了） referenced_links = ['your_initial

我想抓取整个网站。我使用Simple_html_dom进行解析，但问题是一次只需要一个网页链接。我想只提供开始（主页）链接，它应该爬网和自动解析该网站的所有网页。有什么建议吗？

当解析单个页面的DOM时，将所有链接（在同一个域中）存储在一个数组中。然后，在解析结束时，检查数组是否为空。如果不是这样，那么使用第一个链接并执行相同的操作

比如（代码示例是用类似Python的语法编写的，但您可以很容易地将其应用于PHP—我的已经过时了）

referenced_links = ['your_initial_page.html']

while referenced_links:  # if the array isn't empty...
    crawl_dom(referenced_links[0])
    referenced_links.pop(0)  # remove the first item in that array

def crawl_dom(url):
    # download the url, parse the DOM and append all hyperlinks to the array referenced_links