Python-如何在https上搜索驻留在iframe中的zip文件 Python-2.7.5 谷歌浏览器_Python_Iframe_Web_Https_Unzip

Python-如何在https上搜索驻留在iframe中的zip文件 Python-2.7.5 谷歌浏览器

python iframe web https

Python-如何在https上搜索驻留在iframe中的zip文件 Python-2.7.5 谷歌浏览器,python,iframe,web,https,unzip,Python,Iframe,Web,Https,Unzip,首先，我是自学成才的编码员，将接受对我在下面发布的任何代码的任何批评和/或建议。解决这个问题是一件很愉快的事情，因为我喜欢挑战自己，但恐怕我遇到了困难，需要一些指导。我将在下面尽可能详细地解释我的脚本的整体情况，然后展示我在标题中解释的实际问题上的立场我正在编写一个脚本，它将自动下载数据、上传并导出到GDB。我们为广大用户提供服务，拥有一个非常庞大的企业SDE设置，其中包含大量的公共数据，我们必须为最终用户进行搜索和更新。我们的大部分数据由当地政府实体每月更新，我们必须出去手动搜索数据、下载、

首先，我是自学成才的编码员，将接受对我在下面发布的任何代码的任何批评和/或建议。解决这个问题是一件很愉快的事情，因为我喜欢挑战自己，但恐怕我遇到了困难，需要一些指导。我将在下面尽可能详细地解释我的脚本的整体情况，然后展示我在标题中解释的实际问题上的立场

我正在编写一个脚本，它将自动下载数据、上传并导出到GDB。我们为广大用户提供服务，拥有一个非常庞大的企业SDE设置，其中包含大量的公共数据，我们必须为最终用户进行搜索和更新。我们的大部分数据由当地政府实体每月更新，我们必须出去手动搜索数据、下载、解压、QAQC等。我想把脚本a放在一起，通过出去为我下载所有数据并导出到当地GDB，自动化这一过程的第一部分，从那里我可以QAQC的一切，并上传到我们的SDE为我们的用户访问

到目前为止，这个过程一直相当直接，直到我谈到我面前的这个问题。我的脚本将搜索特定关键字的网页，找到相关链接并开始下载。对于这篇文章，我将使用两个例子，一个是有效的，另一个是目前给我的问题。我的功能是搜索和下载Metro GIS数据集，下面显示了我当前查找该数据集的过程。到目前为止，我包含的所有http网站都将使用下面的posted功能。像Metro一样，我计划为每组数据定义一个函数

import requests, zipfile, StringIO, time, arcpy, urllib2, urlparse
from BeautifulSoup import BeautifulSoup

arcpy.env.overwriteOutput = True

workPath = -- #The output GDB
timestr = time.strftime("%Y%m%d")
gdbName = "GlobalSDEUpdate_" + timestr
gdbPath = workPath + "\\" + gdbName + ".gdb"

class global_DataFinder(object):
    def __init__(self):
        object.__init__(self)
        self.gdbSetup()
        self.metro()

    def gdbSetup(self):       
        arcpy.CreateFileGDB_management(workPath, gdbName)

    def fileDownload(self, key, url, dlPath, dsName):
        page = urllib2.urlopen(url).read()
        urlList = []

        soup = BeautifulSoup(page)
        soup.prettify()

        for link in soup.findAll('a', href = True):
            if not 'http://' in link['href']:
                if urlparse.urljoin(url, link['href']) not in urlList:
                    zipDL = urlparse.urljoin(url, link['href'])
                    if zipDL.endswith(".zip"):
                        if key in zipDL:
                            urlList.append(zipDL)        

        for x in urlList:
            print x
            r = requests.get(x, stream=True)
            z = zipfile.ZipFile(StringIO.StringIO(r.content))        
            z.extractall(dlPath)        

        arcpy.CreateFeatureDataset_management(gdbPath, dsName)
        arcpy.env.workspace = dlPath
        shpList = []

        for shp in arcpy.ListFeatureClasses():
            shpList.append(shp)

        arcpy.FeatureClassToGeodatabase_conversion(shpList, (gdbPath + "\\" + dsName))

        del shpList[:]

    def metro(self):
        key = "METRO_GIS_Data_Layers"
        url = "http://www.ridemetro.org/Pages/NewsDownloads.aspx"
        dlPath = -- *#Where my zipfiles output to*  
        dsName = "Metro"

        self.fileDownload(key, url, dlPath, dsName)

global_DataFinder()

正如您在上面所看到的，这是我开始使用Metro作为我的第一个测试点的方法，目前效果很好。我希望我所有的网站都会喜欢这样，但当我到联邦应急管理局时，我遇到了一个问题

该网站拥有全国许多县的洪泛平原数据，任何想使用该网站的人都可以免费使用。当到达网站时，你会看到你可以搜索你想要的县，然后表格查询出搜索结果，然后你只需点击并下载你想要的县。当检查源代码时，这是我在iframe中遇到并注意到的

当通过Chrome访问iframe源链接并检查png源url时，这就是您得到的-

import requests
import os
import zipfile
from pyquery import PyQuery
from requests.packages.urllib3.exceptions import InsecureRequestWarning, InsecurePlatformWarning, SNIMissingWarning

import httplib
httplib.HTTPConnection._http_vsn = 10
httplib.HTTPConnection._http_vsn_str = 'HTTP/1.0'

# disable ssl warnings (we are not verifying SSL certificates at this time...future ehnancement?)
for warning in [SNIMissingWarning, InsecurePlatformWarning, InsecureRequestWarning]:
    requests.packages.urllib3.disable_warnings(warning)

def download_zips(out_path):
    url = 'http://www.floodmaps.fema.gov/NFHL/status.shtml'
    download_prefix = 'https://hazards.fema.gov/femaportal/NFHL'
    pq = PyQuery(requests.get(url, verify=False).content) #verify param important for SSL
    src = pq.find('iframe').attr('src')
    pq = PyQuery(requests.get(src, verify=False).content)
    table = pq.find('table')

    for a in table.find('a'):
        href = a.attrib.get('href')
        print href
        url = '/'.join([download_prefix, href])
        print url
        r = requests.get(url, stream=True, verify=False)
        out_zip = os.path.join(out_path, href.split('=')[-1])
        with open(out_zip, 'wb') as f:
            for chunk in r.iter_content(1024 *16): #grab 1KB at a time
                if chunk:
                    f.write(chunk)
        print 'downloaded zip: "{}"'.format(href.split('=')[-1])


out_path = r"C:\Users\barr\Desktop\Test"
download_zips(out_path)

这就是我的问题所在，与http站点不同，我很快就了解到访问安全的https站点和抓取页面是不同的，特别是当它使用javascript显示表时。我花了几个小时在论坛上搜索，并尝试了不同的python包，如selenium、mechanize、requests、urllib、urllib2，在我能够安全地建立连接、解析网页和搜索我的zipfile之前，我似乎总是走到了死胡同。下面的代码显示了我得到的最接近值，并显示了我得到的错误代码

（我总是在一个单独的脚本中进行测试，然后当它工作时，我将它带到我的主脚本中，这就是为什么下面的代码片段与我的原始代码片段分开的原因）

运行此测试时出现的错误

urllib2.URLError:urlopen错误[Errno 6]\u ssl.c:504:TLS/ssl连接已关闭

我希望一个更有经验的程序员能看到我所做的，并告诉我我目前的方法是否可行，如果可行，如何克服这个最终错误并正确解析数据表

使用@crmackey进行编辑

我只添加了httplib，并在顶部更改了HTTPConnection。允许我使用您的脚本连接到站点。现在是当前的问题。我的out_路径中只有1个zip文件，而zip文件是空的。我在debug窗口中检查了打印的源代码，它显示它试图从表中下载VIRGIN ISLAND zip文件的版图，因此它看起来像是在尝试，但没有下载任何内容。在输出一个空zip文件后，脚本将完成，并且不再显示任何错误消息。我暂时删除了您解压缩文件的行，因为它们返回了一个错误，因为文件夹是空的。
我能够使用请求模块下载zip文件，并且选择使用而不是Beautiful Soup。我认为您面临的问题与SSL证书验证有关，
请求
模块将允许您在将
验证
参数设置为
假
时跳过检查证书
下面的函数将下载所有zip文件并将其解压缩，从那里，您可以将ShapeFile导入您的地理数据库：

import requests import os import zipfile from pyquery import PyQuery from requests.packages.urllib3.exceptions import InsecureRequestWarning, InsecurePlatformWarning, SNIMissingWarning # disable ssl warnings (we are not verifying SSL certificates at this time...future ehnancement?) for warning in [SNIMissingWarning, InsecurePlatformWarning, InsecureRequestWarning]: requests.packages.urllib3.disable_warnings(warning) def download_zips(out_path): url = 'http://www.floodmaps.fema.gov/NFHL/status.shtml' download_prefix = 'https://hazards.fema.gov/femaportal/NFHL' pq = PyQuery(requests.get(url, verify=False).content) #verify param important for SSL src = pq.find('iframe').attr('src') pq = PyQuery(requests.get(src, verify=False).content) table = pq.find('table') for a in table.find('a'): href = a.attrib.get('href') url = '/'.join([download_prefix, href]) r = requests.get(url, stream=True, verify=False) out_zip = os.path.join(out_path, href.split('=')[-1]) with open(out_zip, 'wb') as f: for chunk in r.iter_content(1024 *16): #grab 1KB at a time if chunk: f.write(chunk) print 'downloaded zip: "{}"'.format(href.split('=')[-1]) # do more stuff like unzip? unzipped = out_zip.split('.zip')[0] with zipfile.Zipfile(out_zip, 'r') as f: f.extractall(unzipped)

我在GIS堆栈交换上看到了这一点，但在问题被搁置之前，没有得到添加的答案。请看下面我的答案。谢谢你花时间来整理@crmackey。我将您的脚本复制到一个py文件中，添加了一个out_path变量并在中进行了测试，我得到了这个错误。requests.exceptions.ChunkedEncodingError:（'Connection Breaked:UncompleteRead（读取0字节，预期5122更多）'），UncompleteRead（读取0字节，预期5122更多））你认为这与我在办公室的连接有关吗？错误发生在-pq=PyQuery（requests.get（src，verify=False）.content）处，这可能很奇怪。它下载了zip文件吗？还是第一次就失败了？我对代码做了一次修改，试图克服这个错误，它似乎起了作用，但现在我又发生了另一件奇怪的事情。我现在将在显示它的原始帖子中进行编辑。原始帖子中的新编辑现在应可见，并标记有您的姓名。
import requests import os import zipfile from pyquery import PyQuery from requests.packages.urllib3.exceptions import InsecureRequestWarning, InsecurePlatformWarning, SNIMissingWarning # disable ssl warnings (we are not verifying SSL certificates at this time...future ehnancement?) for warning in [SNIMissingWarning, InsecurePlatformWarning, InsecureRequestWarning]: requests.packages.urllib3.disable_warnings(warning) def download_zips(out_path): url = 'http://www.floodmaps.fema.gov/NFHL/status.shtml' download_prefix = 'https://hazards.fema.gov/femaportal/NFHL' pq = PyQuery(requests.get(url, verify=False).content) #verify param important for SSL src = pq.find('iframe').attr('src') pq = PyQuery(requests.get(src, verify=False).content) table = pq.find('table') for a in table.find('a'): href = a.attrib.get('href') url = '/'.join([download_prefix, href]) r = requests.get(url, stream=True, verify=False) out_zip = os.path.join(out_path, href.split('=')[-1]) with open(out_zip, 'wb') as f: for chunk in r.iter_content(1024 *16): #grab 1KB at a time if chunk: f.write(chunk) print 'downloaded zip: "{}"'.format(href.split('=')[-1]) # do more stuff like unzip? unzipped = out_zip.split('.zip')[0] with zipfile.Zipfile(out_zip, 'r') as f: f.extractall(unzipped)