Python 我的web爬虫脚本的重试机制_Python_Python 3.x_Web_Url_Networking

Python 我的web爬虫脚本的重试机制

python python-3.x web url networking

Python 我的web爬虫脚本的重试机制,python,python-3.x,web,url,networking,Python,Python 3.x,Web,Url,Networking,因此，我正在尝试制作一个网站爬虫程序，它将检索网站内的所有链接，并将它们打印到控制台，同时使用python脚本将链接重定向到文本文件此脚本将接收您要从中检索链接的网站的URL，以及主页中要遵循的URL数量，以及要检索的最大URL数，然后使用函数crawl（），有效（）和获取所有网站链接（）它检索URL。它还通过获取所有网站链接（）功能分离外部链接和内部链接到目前为止，我已经成功地使用脚本检索、打印和重定向到文本文件的链接，但是当服务器拒绝连接时，我遇到了一个问题。它停止链接检索，并结束执行

因此，我正在尝试制作一个网站爬虫程序，它将检索网站内的所有链接，并将它们打印到控制台，同时使用python脚本将链接重定向到文本文件
此脚本将接收您要从中检索链接的网站的URL，以及主页中要遵循的URL数量，以及要检索的最大URL数，然后使用函数
crawl（）
，
有效（）
和
获取所有网站链接（）
它检索URL。它还通过
获取所有网站链接（）
功能分离外部链接和内部链接
到目前为止，我已经成功地使用脚本检索、打印和重定向到文本文件的链接，但是当服务器拒绝连接时，我遇到了一个问题。它停止链接检索，并结束执行
我想让我的脚本做的是重试指定次数，如果失败，即使重试后也继续到下一个链接
我试图自己实现这个机制，但我没有得到任何想法
为了让您更好地理解，我在下面附加了我的python脚本
如能提供详细的解释和实施，我们将不胜感激
如果我的语法不好，请原谅；）
谢谢您的时间：）

这回答了你的问题吗@Maurice Meyer问题是，我正在使用
BeautifulSoup4
检索URL，但您让我参考的文章库
的答案要求
。无论如何，感谢您的时间，我自己解决了这个难题！谢谢你的合作！
import requests from urllib.parse import urlparse, urljoin from bs4 import BeautifulSoup import colorama import sys sys.setrecursionlimit(99999999) print("WEBSITE CRAWLER".center(175,"_")) print("\n","="*175) print("\n\n\n\nThis program does not tolerate faults!\nPlease type whatever you are typing correctly!\nIf you think you have made a mistake please close the program and reopen it!\nIf you proceed with errors the program will crash/close!\nHelp can be found in the README.txt file!\n\n\n") print("\n","="*175) siteurl = input("Enter the address of the site (Please don't forget https:// or http://, etc. at the front!) :") max_urls = int(input("Enter the number of urls you want to crawl through the main page : ")) filename = input("Give a name for your text file (Don't append .txt at the end!) : ") # init the colorama module colorama.init() GREEN = colorama.Fore.GREEN MAGENTA = colorama.Fore.MAGENTA RESET = colorama.Fore.RESET # initialize the set of links (unique links) internal_urls = set() external_urls = set() def is_valid(url): """ Checks whether `url` is a valid URL. """ parsed = urlparse(url) return bool(parsed.netloc) and bool(parsed.scheme) def get_all_website_links(url): """ Returns all URLs that is found on `url` in which it belongs to the same website """ # all URLs of `url` urls = set() # domain name of the URL without the protocol domain_name = urlparse(url).netloc soup = BeautifulSoup(requests.get(url).content, "html.parser") for a_tag in soup.findAll("a"): href = a_tag.attrs.get("href") if href == "" or href is None: # href empty tag continue # join the URL if it's relative (not absolute link) href = urljoin(url, href) parsed_href = urlparse(href) # remove URL GET parameters, URL fragments, etc. href = parsed_href.scheme + "://" + parsed_href.netloc + parsed_href.path if not is_valid(href): # not a valid URL continue if href in internal_urls: # already in the set continue if domain_name not in href: # external link if href not in external_urls: print(f"{MAGENTA} [!] External link: {href}{RESET}") with open(filename+".txt","a") as f: print(f"{href}",file = f) external_urls.add(href) continue print(f"{GREEN}[*] Internal link: {href}{RESET}") with open(filename+".txt","a") as f: print(f"{href}",file = f) urls.add(href) internal_urls.add(href) return urls # number of urls visited so far will be stored here total_urls_visited = 0 def crawl(url, max_urls=50000): """ Crawls a web page and extracts all links. You'll find all links in `external_urls` and `internal_urls` global set variables. params: max_urls (int): number of max urls to crawl, default is 30. """ global total_urls_visited total_urls_visited += 1 links = get_all_website_links(url) for link in links: if total_urls_visited > max_urls: break crawl(link, max_urls=max_urls) if __name__ == "__main__": crawl(siteurl,max_urls) print("[+] Total External links:", len(external_urls)) print("[+] Total Internal links:", len(internal_urls)) print("[+] Total:", len(external_urls) + len(internal_urls)) input("Press any key to exit...")