Python 我的web爬虫脚本的重试机制

Python 我的web爬虫脚本的重试机制,python,python-3.x,web,url,networking,Python,Python 3.x,Web,Url,Networking,因此,我正在尝试制作一个网站爬虫程序,它将检索网站内的所有链接,并将它们打印到控制台,同时使用python脚本将链接重定向到文本文件 此脚本将接收您要从中检索链接的网站的URL,以及主页中要遵循的URL数量,以及要检索的最大URL数,然后使用函数crawl(),有效()和获取所有网站链接()它检索URL。它还通过获取所有网站链接()功能分离外部链接和内部链接 到目前为止,我已经成功地使用脚本检索、打印和重定向到文本文件的链接,但是当服务器拒绝连接时,我遇到了一个问题。它停止链接检索,并结束执行

因此,我正在尝试制作一个网站爬虫程序,它将检索网站内的所有链接,并将它们打印到控制台,同时使用python脚本将链接重定向到文本文件

此脚本将接收您要从中检索链接的网站的URL,以及主页中要遵循的URL数量,以及要检索的最大URL数,然后使用函数
crawl()
有效()
获取所有网站链接()
它检索URL。它还通过
获取所有网站链接()
功能分离外部链接和内部链接

到目前为止,我已经成功地使用脚本检索、打印和重定向到文本文件的链接,但是当服务器拒绝连接时,我遇到了一个问题。它停止链接检索,并结束执行

我想让我的脚本做的是重试指定次数,如果失败,即使重试后也继续到下一个链接

我试图自己实现这个机制,但我没有得到任何想法

为了让您更好地理解,我在下面附加了我的python脚本

如能提供详细的解释和实施,我们将不胜感激

如果我的语法不好,请原谅;)

谢谢您的时间:)


这回答了你的问题吗@Maurice Meyer问题是,我正在使用
BeautifulSoup4
检索URL,但您让我参考的文章库
的答案要求
。无论如何,感谢您的时间,我自己解决了这个难题!谢谢你的合作!
import requests
from urllib.parse import urlparse, urljoin
from bs4 import BeautifulSoup
import colorama
import sys


sys.setrecursionlimit(99999999)


print("WEBSITE CRAWLER".center(175,"_"))
print("\n","="*175)
print("\n\n\n\nThis program does not tolerate faults!\nPlease type whatever you are typing correctly!\nIf you think you have made a mistake please close the program and reopen it!\nIf you proceed with errors the program will crash/close!\nHelp can be found in the README.txt file!\n\n\n")
print("\n","="*175)


siteurl = input("Enter the address of the site (Please don't forget https:// or http://, etc. at the front!) :")
max_urls = int(input("Enter the number of urls you want to crawl through the main page : "))
filename = input("Give a name for your text file (Don't append .txt at the end!) : ")


# init the colorama module
colorama.init()
GREEN = colorama.Fore.GREEN
MAGENTA = colorama.Fore.MAGENTA
RESET = colorama.Fore.RESET


# initialize the set of links (unique links)
internal_urls = set()
external_urls = set()


def is_valid(url):
    """
    Checks whether `url` is a valid URL.
    """
    parsed = urlparse(url)
    return bool(parsed.netloc) and bool(parsed.scheme)


def get_all_website_links(url):
    """
    Returns all URLs that is found on `url` in which it belongs to the same website
    """
    # all URLs of `url`
    urls = set()
    # domain name of the URL without the protocol
    domain_name = urlparse(url).netloc
    soup = BeautifulSoup(requests.get(url).content, "html.parser")
    for a_tag in soup.findAll("a"):
        href = a_tag.attrs.get("href")
        if href == "" or href is None:
            # href empty tag
            continue
        # join the URL if it's relative (not absolute link)
        href = urljoin(url, href)
        parsed_href = urlparse(href)
        # remove URL GET parameters, URL fragments, etc.
        href = parsed_href.scheme + "://" + parsed_href.netloc + parsed_href.path
        if not is_valid(href):
            # not a valid URL
            continue
        if href in internal_urls:
            # already in the set
            continue
        if domain_name not in href:
            # external link
            if href not in external_urls:
                print(f"{MAGENTA} [!] External link: {href}{RESET}")
                with open(filename+".txt","a") as f:
                    print(f"{href}",file = f)
                external_urls.add(href)
            continue
        print(f"{GREEN}[*] Internal link: {href}{RESET}")
        with open(filename+".txt","a") as f:
            print(f"{href}",file = f)
        urls.add(href)
        internal_urls.add(href)
    return urls



# number of urls visited so far will be stored here
total_urls_visited = 0

def crawl(url, max_urls=50000):
    """
    Crawls a web page and extracts all links.
    You'll find all links in `external_urls` and `internal_urls` global set variables.
    params:
        max_urls (int): number of max urls to crawl, default is 30.
    """
    global total_urls_visited
    total_urls_visited += 1
    links = get_all_website_links(url)
    for link in links:
        if total_urls_visited > max_urls:
            break
        crawl(link, max_urls=max_urls)



        


if __name__ == "__main__":
    crawl(siteurl,max_urls)
    print("[+] Total External links:", len(external_urls))
    print("[+] Total Internal links:", len(internal_urls))
    print("[+] Total:", len(external_urls) + len(internal_urls))
    input("Press any key to exit...")