Python 无法使用beautiful soup分析第二页_Python_Beautifulsoup

Python 无法使用beautiful soup分析第二页

python

Python 无法使用beautiful soup分析第二页,python,beautifulsoup,Python,Beautifulsoup,我正在尝试使用beautifulsoup浏览一个网站。我打开第一个页面并找到我想要的链接，但是当我要求Beauty soup打开下一个页面时，没有一个HTML被解析，它只返回这个 <function scraper at 0x000001E3684D0E18> 我试着用自己的脚本打开第二个页面，但效果很好，所以问题在于从另一个页面解析一个页面我有约2000个链接，我需要通过，所以我创建了一个功能，通过他们。这是我目前的剧本 from urllib.request import u

我正在尝试使用beautifulsoup浏览一个网站。我打开第一个页面并找到我想要的链接，但是当我要求Beauty soup打开下一个页面时，没有一个HTML被解析，它只返回这个

<function scraper at 0x000001E3684D0E18>

我试着用自己的脚本打开第二个页面，但效果很好，所以问题在于从另一个页面解析一个页面

我有约2000个链接，我需要通过，所以我创建了一个功能，通过他们。这是我目前的剧本

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
import lxml

# The first webpage I'm parsing
my_url = 'https://mars.nasa.gov/msl/multimedia/raw/'

#calls the urlopen function from the request module of the urllib module
# AKA opens up the connection and grabs the page
uClient = uReq(my_url)

#imports the entire webpage from html format into python. 
# If webpage has lots of data this can take a long time and take up a lot of 
space or crash 
page_html = uClient.read()

#closes the client
uClient.close()


#parses the HTML using bs4
page_soup = soup(page_html, "lxml")

#finds the categories for the types of images on the site, category 1 is 
RHAZ
containers = page_soup.findAll("div", {"class": "image_list"})

RHAZ = containers[1]


# prints the links in RHAZ
links = []
for link in RHAZ.find_all('a'):
#removes unwanted characters from the link making it usable.
formatted_link = my_url+str(link).replace('\n','').split('>') 
[0].replace('%5F\"','_').replace('amp;','').replace('<a href=\"./','')
links.append(formatted_link)

print (links[1])
# I know i should be defining a function here.. so ill give it a go.
def scraper():
pic_page = uReq('links[1]') #calls the first link in the list
page_open = uClient.read() #reads the page in a python accessible format
uClient.close() #closes the page after it's been stored to memory
soup_open = soup(page_open, "lxml")
print (soup_open)
print (scraper)

我是否需要清除beautifulsoup中先前加载的HTML，以便打开下一页？如果是的话，我会怎么做？感谢您提供的帮助

您需要从第一页抓取的URL发出请求…请检查此代码

from bs4 import BeautifulSoup
import requests

url = 'https://mars.nasa.gov/msl/multimedia/raw'
req = requests.get(url)
soup = BeautifulSoup(req.content, 'lxml')
img_list = soup.find_all('div', attrs={'class': 'image_list'})
for i in img_list:
    image = i.find_all('a')
    for x in image:
        href = x['href'].replace('.', '')
        link = (str(url)+str(href))
        req2 = requests.get(link)
        soup2 = BeautifulSoup(req2.content, 'lxml')
        img_list2 = soup2.find_all('div', attrs={
            'class': 'RawImageUTC'})
        for l in img_list2:
            image2 = l.find_all('a')
            for y in image2:
                href2 = y['href']
                print(href2)

输出：

http://mars.jpl.nasa.gov/msl-raw-images/proj/msl/redops/ods/surface/sol/02172/opgs/edr/fcam/FLB_590315340EDR_F0722464FHAZ00337M_.JPG
http://mars.jpl.nasa.gov/msl-raw-images/proj/msl/redops/ods/surface/sol/02172/opgs/edr/fcam/FRB_590315340EDR_F0722464FHAZ00337M_.JPG
http://mars.jpl.nasa.gov/msl-raw-images/proj/msl/redops/ods/surface/sol/02172/opgs/edr/fcam/FLB_590315340EDR_T0722464FHAZ00337M_.JPG
http://mars.jpl.nasa.gov/msl-raw-images/proj/msl/redops/ods/surface/sol/02172/opgs/edr/fcam/FRB_590315340EDR_T0722464FHAZ00337M_.JPG
http://mars.jpl.nasa.gov/msl-raw-images/proj/msl/redops/ods/surface/sol/02171/opgs/edr/fcam/FLB_590214757EDR_F0722464FHAZ00341M_.JPG
http://mars.jpl.nasa.gov/msl-raw-images/proj/msl/redops/ods/surface/sol/02171/opgs/edr/fcam/FRB_590214757EDR_F0722464FHAZ00341M_.JPG
http://mars.jpl.nasa.gov/msl-raw-images/proj/msl/redops/ods/surface/sol/02171/opgs/edr/fcam/FLB_590214757EDR_T0722464FHAZ00341M_.JPG
http://mars.jpl.nasa.gov/msl-raw-images/proj/msl/redops/ods/surface/sol/02171/opgs/edr/fcam/FRB_590214757EDR_T0722464FHAZ00341M_.JPG
http://mars.jpl.nasa.gov/msl-raw-images/proj/msl/redops/ods/surface/sol/02170/opgs/edr/fcam/FLB_590149941EDR_F0722464FHAZ00337M_.JPG
http://mars.jpl.nasa.gov/msl-raw-images/proj/msl/redops/ods/surface/sol/02170/opgs/edr/fcam/FRB_590149941EDR_F0722464FHAZ00337M_.JPG
http://mars.jpl.nasa.gov/msl-raw-images/proj/msl/redops/ods/surface/sol/02170/opgs/edr/fcam/FLB_590134317EDR_S0722464FHAZ00214M_.JPG
http://mars.jpl.nasa.gov/msl-raw-images/proj/msl/redops/ods/surface/sol/02170/opgs/edr/fcam/FLB_590134106EDR_S0722464FHAZ00214M_.JPG
http://mars.jpl.nasa.gov/msl-raw-images/proj/msl/redops/ods/surface/sol/02170/opgs/edr/fcam/FLB_590134065EDR_S0722464FHAZ00214M_.JPG
http://mars.jpl.nasa.gov/msl-raw-images/proj/msl/redops/ods/surface/sol/02170/opgs/edr/fcam/FLB_590134052EDR_S0722464FHAZ00222M_.JPG
http://mars.jpl.nasa.gov/msl-raw-images/proj/msl/redops/ods/surface/sol/02170/opgs/edr/fcam/FLB_590133948EDR_S0722464FHAZ00222M_.JPG

你能更具体地说明你到底想做什么吗？看起来您可能正在寻找后部危险避免摄像头图像/链接，尤其是？脚本的第一部分收集后部haz的所有链接。将它们转换为url格式。然后我想打开每个链接并下载这些页面上的图片。你是什么意思？它将打印所有的URL