Python 下载网络喜剧并保存空白文件_Python_Web Scraping

Python 下载网络喜剧并保存空白文件

python web-scraping

Python 下载网络喜剧并保存空白文件,python,web-scraping,Python,Web Scraping,我有一个脚本下载有问题的内容网络喜剧。它看起来运行正常，但它下载的文件是空的，只有几kb大小 #import Web, Reg. Exp, and Operating System libraries import urllib, re, os #RegExp for the EndNum variable RegExp = re.compile('.*<img src="http://www.questionablecontent.net/comics.*') #Check the

我有一个脚本下载有问题的内容网络喜剧。它看起来运行正常，但它下载的文件是空的，只有几kb大小

#import Web, Reg. Exp, and Operating System libraries
import urllib, re, os

#RegExp for the EndNum variable
RegExp = re.compile('.*<img src="http://www.questionablecontent.net/comics.*')

#Check the main QC page
site = urllib.urlopen("http://questionablecontent.net/")
contentLine = None

#For each line in the homepage's source...
for line in site.readlines():
    #Break when you find the variable information
    if RegExp.search(line):
        contentLine = line
    break

#IF the information was found successfuly automatically change EndNum
#ELSE set it to the latest comic as of this writing
if contentLine:
    contentLine = contentLine.split('/')
    contentLine = contentLine[4].split('.')
    EndNum = int(contentLine[0])
else:
    EndNum = 2622

#First and Last comics user wishes to download
StartNum = 1
#EndNum = 2622

#Full path of destination folder needs to pre-exist
destinationFolder = "D:\Downloads\Comics\Questionable Content"

#XRange creates an iterator to go over the comics
for i in xrange(StartNum, EndNum+1):

    #IF you already have the comic, skip downloading it
    if os.path.exists(destinationFolder+"\\"+str(i)+".png"):
        print "Skipping Comic "+str(i)+"..."
        continue

    #Printing User-Friendly Messages
    print "Comic %d Found. Downloading..." % i

    source = "http://www.questionablecontent.net/comics/"+str(i)+".png"

    #Save image from XKCD to Destination Folder as a PNG (As most comics are PNGs)
    urllib.urlretrieve(source, os.path.join(destinationFolder, str(i)+".png"))

#Graceful program termination
print str(EndNum-StartNum) + " Comics Downloaded"

为什么它总是下载空文件？有什么解决方法吗？

这里的问题是，如果没有设置用户代理，服务器就不会为您提供映像。下面是Python2.7的示例代码，它应该能让您了解如何使脚本工作

import urllib2
import time

first = 1
last = 2622

for i in range(first, last+1):
    time.sleep(5) # Be nice to the server! And avoid being blocked.
    for ext in ['png', 'gif']:
        # Make sure that the img dir exists! If not, the script will throw an
        # IOError
        with open('img/{}.{}'.format(i, ext), 'wb') as ifile:
            try:
                req = urllib2.Request('http://www.questionablecontent.net/comics/{}.{}'.format(i, ext))
                req.add_header('user-agent', 'Mozilla/5.0')
                ifile.write(urllib2.urlopen(req).read())
                break
            except urllib2.HTTPError:
                continue
    else:
        print 'Could not find image {}'.format(i)
        continue
    print 'Downloaded image {}'.format(i)

您可能希望将循环更改为类似于您的循环的内容，检查图像之前是否已下载等。。此脚本将尝试从下载所有图像。到，gif或png在哪里。

谢谢，在那里试一试，似乎效果很好。再次感谢！