Python regex或仅获取图像URL的方法_Python_Regex_Web Scraping_Beautifulsoup

Python regex或仅获取图像URL的方法

python regex web-scraping

Python regex或仅获取图像URL的方法,python,regex,web-scraping,beautifulsoup,Python,Regex,Web Scraping,Beautifulsoup,我想从下一页下载图像我使用urllib下载了它，并使用BeautifulSoup进行了解析。它包含许多URL，我只想要那些以.jpg结尾的URL，它们还有rel=“prettypto[gallery]”标记。如何使用Beautifulsoup进行此操作？链接的示例你的代码有很多不必要的东西。也许您以后会使用它们，但是像将count指定为2然后在for range循环中将其用作计数器这样的事情是毫无意义的。以下是执行您所需操作的代码： import urllib2 from bs4 imp

我想从下一页下载图像我使用urllib下载了它，并使用BeautifulSoup进行了解析。它包含许多URL，我只想要那些以.jpg结尾的URL，它们还有rel=“prettypto[gallery]”标记。如何使用Beautifulsoup进行此操作？链接的示例

你的代码有很多不必要的东西。也许您以后会使用它们，但是像将

count

指定为

然后在

for range

循环中将其用作计数器这样的事情是毫无意义的。以下是执行您所需操作的代码：

import urllib2
from bs4 import BeautifulSoup
baseurl='http://wordpandit.com/learning-bin/visual-vocabulary/page/'

for count in range(1,2):
    url = baseurl + str(count) + "/"
    html_page = urllib2.urlopen(url)
    soup = BeautifulSoup(html_page)
    atag = soup.findAll(rel="prettyPhoto[gallery]", href = True)
    for tag in atag:
        if tag['href'].endswith(".jpg"):
            imgurl = tag['href']
            img = urllib2.urlopen("http://wordpandit.com" + imgurl)
            with open(imgurl.split("/")[-1], "wb") as local_file:
                local_file.write(img.read())

在

html\u页面上需要一个.read（）
，最好在findAll
中添加一个href=True
（注意：如果使用bs4，将/应该是find\u all
）query@JonClements，为什么我需要read（）
查看html\u页面
？是更快还是更准确？（没有它也可以工作）因为在你读它之前，没有数据——它只是一个连接对象。。。。试着打印你的汤
，你会发现它只不过是str（html\u页面）
@JonClements，如果我打印汤，它会给我所有的html代码。我正在使用Python 2.7.5、Urllib2.7、BeautifulSoup 4.3.1（在Linux和Windows上）好吧，我收回这一点。。。在“过去”的日子里，你必须用一根绳子提供汤。。。看起来它满足了现在的困境：）抱歉；）
import urllib2
from bs4 import BeautifulSoup
baseurl='http://wordpandit.com/learning-bin/visual-vocabulary/page/'

for count in range(1,2):
    url = baseurl + str(count) + "/"
    html_page = urllib2.urlopen(url)
    soup = BeautifulSoup(html_page)
    atag = soup.findAll(rel="prettyPhoto[gallery]", href = True)
    for tag in atag:
        if tag['href'].endswith(".jpg"):
            imgurl = tag['href']
            img = urllib2.urlopen("http://wordpandit.com" + imgurl)
            with open(imgurl.split("/")[-1], "wb") as local_file:
                local_file.write(img.read())