Python 我的web爬虫脚本的重试机制
因此,我正在尝试制作一个网站爬虫程序,它将检索网站内的所有链接,并将它们打印到控制台,同时使用python脚本将链接重定向到文本文件 此脚本将接收您要从中检索链接的网站的URL,以及主页中要遵循的URL数量,以及要检索的最大URL数,然后使用函数Python 我的web爬虫脚本的重试机制,python,python-3.x,web,url,networking,Python,Python 3.x,Web,Url,Networking,因此,我正在尝试制作一个网站爬虫程序,它将检索网站内的所有链接,并将它们打印到控制台,同时使用python脚本将链接重定向到文本文件 此脚本将接收您要从中检索链接的网站的URL,以及主页中要遵循的URL数量,以及要检索的最大URL数,然后使用函数crawl(),有效()和获取所有网站链接()它检索URL。它还通过获取所有网站链接()功能分离外部链接和内部链接 到目前为止,我已经成功地使用脚本检索、打印和重定向到文本文件的链接,但是当服务器拒绝连接时,我遇到了一个问题。它停止链接检索,并结束执行
crawl()
,有效()
和获取所有网站链接()
它检索URL。它还通过获取所有网站链接()
功能分离外部链接和内部链接
到目前为止,我已经成功地使用脚本检索、打印和重定向到文本文件的链接,但是当服务器拒绝连接时,我遇到了一个问题。它停止链接检索,并结束执行
我想让我的脚本做的是重试指定次数,如果失败,即使重试后也继续到下一个链接
我试图自己实现这个机制,但我没有得到任何想法
为了让您更好地理解,我在下面附加了我的python脚本
如能提供详细的解释和实施,我们将不胜感激
如果我的语法不好,请原谅;)
谢谢您的时间:)
这回答了你的问题吗@Maurice Meyer问题是,我正在使用
BeautifulSoup4
检索URL,但您让我参考的文章库的答案要求
。无论如何,感谢您的时间,我自己解决了这个难题!谢谢你的合作!
import requests
from urllib.parse import urlparse, urljoin
from bs4 import BeautifulSoup
import colorama
import sys
sys.setrecursionlimit(99999999)
print("WEBSITE CRAWLER".center(175,"_"))
print("\n","="*175)
print("\n\n\n\nThis program does not tolerate faults!\nPlease type whatever you are typing correctly!\nIf you think you have made a mistake please close the program and reopen it!\nIf you proceed with errors the program will crash/close!\nHelp can be found in the README.txt file!\n\n\n")
print("\n","="*175)
siteurl = input("Enter the address of the site (Please don't forget https:// or http://, etc. at the front!) :")
max_urls = int(input("Enter the number of urls you want to crawl through the main page : "))
filename = input("Give a name for your text file (Don't append .txt at the end!) : ")
# init the colorama module
colorama.init()
GREEN = colorama.Fore.GREEN
MAGENTA = colorama.Fore.MAGENTA
RESET = colorama.Fore.RESET
# initialize the set of links (unique links)
internal_urls = set()
external_urls = set()
def is_valid(url):
"""
Checks whether `url` is a valid URL.
"""
parsed = urlparse(url)
return bool(parsed.netloc) and bool(parsed.scheme)
def get_all_website_links(url):
"""
Returns all URLs that is found on `url` in which it belongs to the same website
"""
# all URLs of `url`
urls = set()
# domain name of the URL without the protocol
domain_name = urlparse(url).netloc
soup = BeautifulSoup(requests.get(url).content, "html.parser")
for a_tag in soup.findAll("a"):
href = a_tag.attrs.get("href")
if href == "" or href is None:
# href empty tag
continue
# join the URL if it's relative (not absolute link)
href = urljoin(url, href)
parsed_href = urlparse(href)
# remove URL GET parameters, URL fragments, etc.
href = parsed_href.scheme + "://" + parsed_href.netloc + parsed_href.path
if not is_valid(href):
# not a valid URL
continue
if href in internal_urls:
# already in the set
continue
if domain_name not in href:
# external link
if href not in external_urls:
print(f"{MAGENTA} [!] External link: {href}{RESET}")
with open(filename+".txt","a") as f:
print(f"{href}",file = f)
external_urls.add(href)
continue
print(f"{GREEN}[*] Internal link: {href}{RESET}")
with open(filename+".txt","a") as f:
print(f"{href}",file = f)
urls.add(href)
internal_urls.add(href)
return urls
# number of urls visited so far will be stored here
total_urls_visited = 0
def crawl(url, max_urls=50000):
"""
Crawls a web page and extracts all links.
You'll find all links in `external_urls` and `internal_urls` global set variables.
params:
max_urls (int): number of max urls to crawl, default is 30.
"""
global total_urls_visited
total_urls_visited += 1
links = get_all_website_links(url)
for link in links:
if total_urls_visited > max_urls:
break
crawl(link, max_urls=max_urls)
if __name__ == "__main__":
crawl(siteurl,max_urls)
print("[+] Total External links:", len(external_urls))
print("[+] Total Internal links:", len(internal_urls))
print("[+] Total:", len(external_urls) + len(internal_urls))
input("Press any key to exit...")