Javascript 使用BeautifulSoup从图像标记Src属性提取JPG_Javascript_Python_Html_Web Scraping_Beautifulsoup

Javascript 使用BeautifulSoup从图像标记Src属性提取JPG

javascript python html web-scraping

Javascript 使用BeautifulSoup从图像标记Src属性提取JPG,javascript,python,html,web-scraping,beautifulsoup,Javascript,Python,Html,Web Scraping,Beautifulsoup,我正在抓取这个网页以供个人使用，并在提取页面上每个项目的缩略图时遇到问题。当我使用 “inspect”查看HTMLDOM时，我可以查看包含所需.jpg的图像标记，但当我使用“查看页面源代码”时，img标记不会显示。起初，我认为这可能是一个异步javascript加载问题，但有可靠的消息来源告诉我，我应该能够直接使用beautifulsoup刮取缩略图 import lxml import requests from bs4 import BeautifulSoup from fake_usera

我正在抓取这个网页以供个人使用，并在提取页面上每个项目的缩略图时遇到问题。当我使用 “inspect”查看HTMLDOM时，我可以查看包含所需.jpg的图像标记，但当我使用“查看页面源代码”时，img标记不会显示。起初，我认为这可能是一个异步javascript加载问题，但有可靠的消息来源告诉我，我应该能够直接使用beautifulsoup刮取缩略图

import lxml
import requests
from bs4 import BeautifulSoup
from fake_useragent import UserAgent

ua = UserAgent()

    r = requests.get("https://asheville.craigslist.org/search/fua", params=dict(postal=28804), headers={"user-agent":ua.chrome})
    soup = BeautifulSoup(r.content, "lxml")
    for post in soup.find_all('li', "result-row"):
        for post_content in post.findAll("a", "result-image gallery"):
            print(post_content['href'])
            for pic in post_content.findAll("img", {'alt class': 'thumb'}):
                print(pic['src'])

有人能澄清我的误解吗？“a”标记的href属性的值将被打印，但我似乎无法打印“img”标记的src属性。提前谢谢

我能够读取带有以下代码的

img

标签：

for post in soup.find_all('li', "result-row"):
    for post_content in post.find_all("a", "result-image gallery"):
        print(post_content['href'])
        for pic in post_content.find_all("img"):
            print(pic['src'])

关于从craigslist抓取的一些想法：

限制每秒的请求。我听说craigslist会在你的IP地址上设置一个临时阻止，如果你的请求频率超过了一定的频率
每篇文章似乎都有一到两幅图片。仔细检查后，除非单击箭头，否则不会加载旋转木马图像。如果你需要每篇文章的每一张照片，你应该找到一种不同的方式来编写脚本，可能是通过访问每篇有多张图片的文章的链接

另外，我认为使用selenium进行网页抓取是非常好的。此项目可能不需要它，但它将允许您执行更多操作，如单击按钮、输入表单数据等。下面是我使用Selenium来抓取数据的快速脚本：

import lxml
import requests
from bs4 import BeautifulSoup
from selenium import webdriver

def test():
    url = "https://asheville.craigslist.org/search/fua"
    driver = webdriver.Firefox()
    driver.get(url)
    html = driver.page_source.encode('utf-8')
    soup = BeautifulSoup(html, "lxml")
    for post in soup.find_all('li', "result-row"):
        for post_content in post.find_all("a", "result-image gallery"):
            print(post_content['href'])
            for pic in post_content.find_all("img"):
                print(pic['src'])

非常感谢您的澄清和建议！我今天刚开始学习使用selenium，肯定会继续使用它来执行web抓取任务。干杯@基南伯克·皮特太棒了！刮得开心！