Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/347.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 用靓汤抓取网页,输入所有链接并获取信息_Python_Web Scraping_Beautifulsoup_Python Requests - Fatal编程技术网

Python 用靓汤抓取网页,输入所有链接并获取信息

Python 用靓汤抓取网页,输入所有链接并获取信息,python,web-scraping,beautifulsoup,python-requests,Python,Web Scraping,Beautifulsoup,Python Requests,我正在尝试打开StackOverflow公司的每家公司,并获取具体信息(如整个描述)。有没有一种简单的方法可以使用Beautiful Soup实现这一点?目前,我正在获取第一页公司的链接 import requests from bs4 import BeautifulSoup r = requests.get('https://stackoverflow.com/jobs/companies') src = r.content soup = BeautifulSoup(src,'lxml')

我正在尝试打开StackOverflow公司的每家公司,并获取具体信息(如整个描述)。有没有一种简单的方法可以使用Beautiful Soup实现这一点?目前,我正在获取第一页公司的链接

import requests
from bs4 import BeautifulSoup

r = requests.get('https://stackoverflow.com/jobs/companies')
src = r.content
soup = BeautifulSoup(src,'lxml')
urls=[]

for h2_tag in soup.find_all("h2"):
    a_tag = h2_tag.find('a')
    urls.append(a_tag.attrs['href'])

print(urls)

还有,您可以滚动第一页,然后使用selenium转到第二页,单击第二页按钮,每次都传递页面源代码,我认为这应该可以工作

import requests
from bs4 import BeautifulSoup as bsoup

for i in range(0, 5):
    site_source = requests.get(
        f"https://stackoverflow.com/jobs/companies?pg={i}"
    ).content
    soup = bsoup(site_source, "html.parser")
    company_list = soup.find("div", class_="company-list")
    company_block = company_list.find_all("div", class_="grid--cell fl1 text")
    for company in company_block:
        if company.find("a"):
            company_url = company.find("a").attrs["href"]
            base_url = "https://stackoverflow.com"
            company_source = requests.get(base_url + company_url).content
            company_soup = bsoup(company_source, "html.parser")
            company_info = company_soup.find("div", id="company-name-tagline")
            print("Name: ", company_info.find("h1").text)
            print("Info: ", company_info.find("p").text)
            print()
我基本上是在第1页到第5页之间循环,获得每个公司的链接,然后转到公司名称,打印出公司名称和描述

我的输出

Name:  BigCommerce
Info:  Think BIG

Name:  Facebook
Info:  Our mission is to give people the power to build community and bring the world closer together.   

Name:  trivago N.V.
Info:  A diverse team of talents that make a blazing fast accommodation search powered by cutting-edge tech and entrepreneurial innovation. 

Name:  General Dynamics UK
Info:  General Dynamics UK is one of the UK’s leading defence companies, and an important supplier to the UK Ministry of Defence (MoD).   

Name:  EDF
Info:  EDF is leading the transition to a cleaner, low emission electric future, tackling climate change and helping Britain reach net zero.

Name:  Radix DLT
Info:  Delivering Scalable Trust.

谢谢你的回答。我的意思是我想打开每家公司的链接,获取我需要的信息,然后回去为下一家公司做同样的事情(我知道如何循环浏览页面)。无论如何,谢谢你!:)你可以查看我的更新答案,我正在做你提到的事情,这是打开每家公司的链接来获取信息。非常感谢你对我的帮助!