Python 3.x 如何实现广度优先和深度优先搜索web爬虫？_Python 3.x_Beautifulsoup_Web Crawler_Depth First Search_Breadth First Search

Python 3.x 如何实现广度优先和深度优先搜索web爬虫？

python-3.x web-crawler

Python 3.x 如何实现广度优先和深度优先搜索web爬虫？,python-3.x,beautifulsoup,web-crawler,depth-first-search,breadth-first-search,Python 3.x,Beautifulsoup,Web Crawler,Depth First Search,Breadth First Search,我正试图用Python编写一个带有BeautifulSoup的web爬虫程序，以便为所有链接爬网一个网页。在我获得主页上的所有链接后，我将尝试实施深度优先和广度优先搜索，以找到100个额外的链接。目前，我已经抓取并获得了主页上的链接。现在我需要帮助实现我的爬虫程序的深度优先和广度优先方面我相信我的网络爬虫正在进行深度优先搜索。这是正确的还是我的代码没有正确地执行深度优先搜索？此外，如何调整代码以创建广度优先搜索？我相信我需要一个队列并使用pop函数，但我不确定如何正确执行循环，因为我对Pyth

我正试图用Python编写一个带有BeautifulSoup的web爬虫程序，以便为所有链接爬网一个网页。在我获得主页上的所有链接后，我将尝试实施深度优先和广度优先搜索，以找到100个额外的链接。目前，我已经抓取并获得了主页上的链接。现在我需要帮助实现我的爬虫程序的深度优先和广度优先方面

我相信我的网络爬虫正在进行深度优先搜索。这是正确的还是我的代码没有正确地执行深度优先搜索？此外，如何调整代码以创建广度优先搜索？我相信我需要一个队列并使用pop函数，但我不确定如何正确执行循环，因为我对Python还不熟悉

我尝试过调整我的代码，但到目前为止，我尝试过的任何东西都无法得到正确的结果

from pandas import *
import urllib.request
import re
import time
from bs4 import BeautifulSoup

#open webpage and put into soup

myURL="http://toscrape.com"
response = urllib.request.urlopen(myURL)
html = response.read()
soup = BeautifulSoup(html, "html.parser")

#get links on the main page 

websitesvisited = []
for link in soup.findAll('a'):
    websitesvisited.append(link.get('href'))

#use depth-first search to find 100 additional links

allLinks= [] 
for links in websitesvisited:
    myURL=links
    response = urllib.request.urlopen(myURL)
    html = response.read()
    soup = BeautifulSoup(html, "html.parser")
    if len(allLinks) < 101:
        for link in soup.findAll('a'):
            if link.get('href') not in allLinks:
                if link.get('href') != None:
                    if link.get('href') [0:4] == 'http':
                        allLinks.append(link.get('href'))
    time.sleep(3)

for weblinks in allLinks:
    print(weblinks)

从导入*
导入urllib.request
进口稀土
导入时间
从bs4导入BeautifulSoup
#打开网页，放入汤中
myURL=”http://toscrape.com"
response=urllib.request.urlopen（myURL）
html=response.read（）
soup=BeautifulSoup（html，“html.parser”）
#获取主页上的链接
网站浏览=[]
对于soup.findAll（'a'）中的链接：
websitesvisited.append（link.get（'href'））
#使用深度优先搜索查找100个附加链接
所有链接=[]
浏览网站中的链接：
myURL=链接
response=urllib.request.urlopen（myURL）
html=response.read（）
soup=BeautifulSoup（html，“html.parser”）
如果len（所有链接）<101：
对于soup.findAll（'a'）中的链接：
如果link.get（'href'）不在所有链接中：
if link.get（'href'）！=无：
如果link.get（'href'）[0:4]=='http':
allLinks.append（link.get（'href'））
时间。睡眠（3）
对于所有链接中的Web链接：
打印（网络链接）

我浏览了网页的主页，获得了所有的链接。现在，我希望通过深度优先和广度优先的网络爬网获得大约100个额外的链接。

你的思路非常正确。DFS的关键是递归，这是上面代码中缺少的元素。对于当前页面上的每个链接，在访问页面上的其余链接之前递归地浏览它。使用

访问集

跟踪哪些页面已被爬网，以避免陷入循环

“浏览的链接总数”值在DFS中可能没有帮助，因为您的爬虫程序只会删除前100页中的第一个链接，然后返回而没有任何宽度（internet上几乎每个页面都有链接，因此很难找到终端节点）。“深度”（或距离）上限更有意义：这使我们能够探索所有链接

max_depth

页面，而不是当前页面

无论哪种方式，代码基本相同，当然，如果您在递归中将其作为基本情况编码，您可以说“给我第一个

cap

链接到

max_depth

pages depth”。另一个想法是确保你正在探索的所有链接都来自quotes.toscrape网站。BFS将严格要求探索直接边界并展开。这可以通过队列迭代完成

下面是一个递归DFS草图：

import requests
from bs4 import BeautifulSoup

def get_links_recursive(base, path, visited, max_depth=3, depth=0):
    if depth < max_depth:
        try:
            soup = BeautifulSoup(requests.get(base + path).text, "html.parser")

            for link in soup.find_all("a"):
                href = link.get("href")

                if href not in visited:
                    visited.add(href)
                    print(f"at depth {depth}: {href}")

                    if href.startswith("http"):
                        get_links_recursive(href, "", visited, max_depth, depth + 1)
                    else:
                        get_links_recursive(base, href, visited, max_depth, depth + 1)
        except:
            pass


get_links_recursive("http://toscrape.com", "", set(["http://toscrape.com"]))

import requests
from bs4 import BeautifulSoup
from collections import deque

visited = set(["http://toscrape.com"])
dq = deque([["http://toscrape.com", "", 0]])
max_depth = 3

while dq:
    base, path, depth = dq.popleft()
    #                         ^^^^ removing "left" makes this a DFS (stack)

    if depth < max_depth:
        try:
            soup = BeautifulSoup(requests.get(base + path).text, "html.parser")

            for link in soup.find_all("a"):
                href = link.get("href")

                if href not in visited:
                    visited.add(href)
                    print("  " * depth + f"at depth {depth}: {href}")

                    if href.startswith("http"):
                        dq.append([href, "", depth + 1])
                    else:
                        dq.append([base, href, depth + 1])
        except:
            pass

导入请求
从bs4导入BeautifulSoup
def get_links_recursive（基本、路径、访问、最大深度=3、深度=0）：
如果深度<最大深度：
尝试：
soup=BeautifulSoup（requests.get（base+path）.text，“html.parser”）
查找汤中的链接。查找所有（“a”）：
href=link.get（“href”）
如果href未被访问：
已访问。添加（href）
打印（f“在深度{depth}:{href}”）
如果href.startswith（“http”）：
获取链接递归（href，“，已访问，最大深度，深度+1）
其他：
获取链接递归（基本、href、已访问、最大深度、深度+1）
除：
通过
获取\u链接\u递归（“http://toscrape.com“，”，集合（[”http://toscrape.com"]))

下面是BFS的草图：

import requests
from bs4 import BeautifulSoup

def get_links_recursive(base, path, visited, max_depth=3, depth=0):
    if depth < max_depth:
        try:
            soup = BeautifulSoup(requests.get(base + path).text, "html.parser")

            for link in soup.find_all("a"):
                href = link.get("href")

                if href not in visited:
                    visited.add(href)
                    print(f"at depth {depth}: {href}")

                    if href.startswith("http"):
                        get_links_recursive(href, "", visited, max_depth, depth + 1)
                    else:
                        get_links_recursive(base, href, visited, max_depth, depth + 1)
        except:
            pass


get_links_recursive("http://toscrape.com", "", set(["http://toscrape.com"]))

import requests
from bs4 import BeautifulSoup
from collections import deque

visited = set(["http://toscrape.com"])
dq = deque([["http://toscrape.com", "", 0]])
max_depth = 3

while dq:
    base, path, depth = dq.popleft()
    #                         ^^^^ removing "left" makes this a DFS (stack)

    if depth < max_depth:
        try:
            soup = BeautifulSoup(requests.get(base + path).text, "html.parser")

            for link in soup.find_all("a"):
                href = link.get("href")

                if href not in visited:
                    visited.add(href)
                    print("  " * depth + f"at depth {depth}: {href}")

                    if href.startswith("http"):
                        dq.append([href, "", depth + 1])
                    else:
                        dq.append([base, href, depth + 1])
        except:
            pass

导入请求
从bs4导入BeautifulSoup
从集合导入deque
访问=设置（[”http://toscrape.com"])
dq=deque（[[”http://toscrape.com", "", 0]])
最大深度=3
而dq：
base，path，depth=dq.popleft（）
#^^^^^删除“left”使其成为DFS（堆栈）
如果深度<最大深度：
尝试：
soup=BeautifulSoup（requests.get（base+path）.text，“html.parser”）
查找汤中的链接。查找所有（“a”）：
href=link.get（“href”）
如果href未被访问：
已访问。添加（href）
打印（“*depth+f”位于深度{depth}:{href}”）
如果href.startswith（“http”）：
追加（[href，”，深度+1]）
其他：
追加（[base，href，depth+1]）
除：
通过

这些是非常简单的草图。hrefs的错误处理和剪枝只是很少处理。存在相对链接和绝对链接的混合，其中一些链接具有前导和/或尾随斜杠。我将把操纵它们作为练习留给读者