Python 3.x 网络爬虫尝试
我试图制作一个网络爬虫来获取网站中的所有链接,我希望它能一直运行,直到它收集并爬网所有链接,但它在很短的时间后停止,我不知道为什么。提前谢谢你 这是我的密码:Python 3.x 网络爬虫尝试,python-3.x,web-scraping,beautifulsoup,Python 3.x,Web Scraping,Beautifulsoup,我试图制作一个网络爬虫来获取网站中的所有链接,我希望它能一直运行,直到它收集并爬网所有链接,但它在很短的时间后停止,我不知道为什么。提前谢谢你 这是我的密码: import requests from bs4 import BeautifulSoup queue = set() crawled = set() DOMAIN = 'https://www.ebay.com/' def finder(url): global crawled, queue, DOMAIN head
import requests
from bs4 import BeautifulSoup
queue = set()
crawled = set()
DOMAIN = 'https://www.ebay.com/'
def finder(url):
global crawled, queue, DOMAIN
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.135 Safari/537.36'}
queue.add(url)
if url not in crawled:
page = requests.get(url, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
links = soup.find_all('a', href=True)
crawled.add(url)
for link in links:
try:
new_url = DOMAIN + link.get('href')
print('Queue: ' + str(len(queue)) + ' | ' + 'Crawled: ' + str(len(crawled)))
print(new_url)
queue.add(new_url)
except:
return ''
for each in set(queue):
try:
finder(each)
except:
return ''
return queue, crawled
my_finder = finder(DOMAIN)
这对我来说很好:
import requests
from bs4 import BeautifulSoup
queue = set()
crawled = set()
DOMAIN = 'https://www.ebay.com/'
def finder(url):
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.135 Safari/537.36'}
queue.add(url)
if url not in crawled:
page = requests.get(url, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
links = soup.find_all('a', href=True)
crawled.add(url)
for link in links:
try:
new_url = link.get('href')
print('Queue: ' + str(len(queue)) + ' | ' + 'Crawled: ' + str(len(crawled)))
print(new_url)
queue.add(new_url)
except:
return ''
for each in set(queue):
try:
finder(each)
except:
return ''
return queue, crawled
finder(DOMAIN)
样本输出(为简洁起见缩短):
另外,尽量不要使用裸
异常。except
子句指定一个或多个异常处理程序。这对我来说很好:
import requests
from bs4 import BeautifulSoup
queue = set()
crawled = set()
DOMAIN = 'https://www.ebay.com/'
def finder(url):
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.135 Safari/537.36'}
queue.add(url)
if url not in crawled:
page = requests.get(url, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
links = soup.find_all('a', href=True)
crawled.add(url)
for link in links:
try:
new_url = link.get('href')
print('Queue: ' + str(len(queue)) + ' | ' + 'Crawled: ' + str(len(crawled)))
print(new_url)
queue.add(new_url)
except:
return ''
for each in set(queue):
try:
finder(each)
except:
return ''
return queue, crawled
finder(DOMAIN)
样本输出(为简洁起见缩短):
另外,尽量不要使用裸异常。except
子句指定了一个或多个异常处理程序。如果不指定,您的意思是什么?给出一个答案,具体问题是什么?谢谢你,刚刚更新了我的描述。如果没有,你是什么意思?给出一个,具体问题是什么?谢谢你,刚刚更新了我的描述。我想这可能是我的网络出了问题,也谢谢你的提示!你知道为什么当我点击这些链接时几乎没有一个有效吗?它给出了一个页面,上面写着“我们找不到这个页面”,这是因为您正在将主URL与在new_URL=DOMAIN+link中找到的URL连接起来。get('href')我已经更新了答案。我想这可能是我的网络出了问题,也感谢您的提示!你知道为什么当我点击这些链接时几乎没有一个有效吗?它给出了一个页面,上面写着“我们找不到这个页面”,这是因为您正在将主URL与在new_URL=DOMAIN+link中找到的URL连接起来。get('href')我已经更新了答案。