Python 用靓汤抓取网页,输入所有链接并获取信息
我正在尝试打开StackOverflow公司的每家公司,并获取具体信息(如整个描述)。有没有一种简单的方法可以使用Beautiful Soup实现这一点?目前,我正在获取第一页公司的链接Python 用靓汤抓取网页,输入所有链接并获取信息,python,web-scraping,beautifulsoup,python-requests,Python,Web Scraping,Beautifulsoup,Python Requests,我正在尝试打开StackOverflow公司的每家公司,并获取具体信息(如整个描述)。有没有一种简单的方法可以使用Beautiful Soup实现这一点?目前,我正在获取第一页公司的链接 import requests from bs4 import BeautifulSoup r = requests.get('https://stackoverflow.com/jobs/companies') src = r.content soup = BeautifulSoup(src,'lxml')
import requests
from bs4 import BeautifulSoup
r = requests.get('https://stackoverflow.com/jobs/companies')
src = r.content
soup = BeautifulSoup(src,'lxml')
urls=[]
for h2_tag in soup.find_all("h2"):
a_tag = h2_tag.find('a')
urls.append(a_tag.attrs['href'])
print(urls)
还有,您可以滚动第一页,然后使用selenium转到第二页,单击第二页按钮,每次都传递页面源代码,我认为这应该可以工作
import requests
from bs4 import BeautifulSoup as bsoup
for i in range(0, 5):
site_source = requests.get(
f"https://stackoverflow.com/jobs/companies?pg={i}"
).content
soup = bsoup(site_source, "html.parser")
company_list = soup.find("div", class_="company-list")
company_block = company_list.find_all("div", class_="grid--cell fl1 text")
for company in company_block:
if company.find("a"):
company_url = company.find("a").attrs["href"]
base_url = "https://stackoverflow.com"
company_source = requests.get(base_url + company_url).content
company_soup = bsoup(company_source, "html.parser")
company_info = company_soup.find("div", id="company-name-tagline")
print("Name: ", company_info.find("h1").text)
print("Info: ", company_info.find("p").text)
print()
我基本上是在第1页到第5页之间循环,获得每个公司的链接,然后转到公司名称,打印出公司名称和描述
我的输出
Name: BigCommerce
Info: Think BIG
Name: Facebook
Info: Our mission is to give people the power to build community and bring the world closer together.
Name: trivago N.V.
Info: A diverse team of talents that make a blazing fast accommodation search powered by cutting-edge tech and entrepreneurial innovation.
Name: General Dynamics UK
Info: General Dynamics UK is one of the UK’s leading defence companies, and an important supplier to the UK Ministry of Defence (MoD).
Name: EDF
Info: EDF is leading the transition to a cleaner, low emission electric future, tackling climate change and helping Britain reach net zero.
Name: Radix DLT
Info: Delivering Scalable Trust.
谢谢你的回答。我的意思是我想打开每家公司的链接,获取我需要的信息,然后回去为下一家公司做同样的事情(我知道如何循环浏览页面)。无论如何,谢谢你!:)你可以查看我的更新答案,我正在做你提到的事情,这是打开每家公司的链接来获取信息。非常感谢你对我的帮助!