Python 简单的刮网器非常慢_Python_Web Scraping

Python 简单的刮网器非常慢

python web-scraping

Python 简单的刮网器非常慢,python,web-scraping,Python,Web Scraping,一般来说，我对python和web抓取相当陌生。下面的代码可以工作，但就其实际传输的信息量而言，它似乎非常慢。是否有任何方法可以轻松缩短执行时间。我不确定，但似乎我输入的内容比实际需要的内容要多/难度更大，如有任何帮助，我们将不胜感激目前，代码从站点地图开始，然后遍历其他站点地图的列表。在新的站点地图中，它提取数据信息，为网页的json数据构建url。我从json数据中提取一个用于搜索字符串的xml链接。如果找到字符串，它会将其附加到文本文件中 #global variable start =

一般来说，我对python和web抓取相当陌生。下面的代码可以工作，但就其实际传输的信息量而言，它似乎非常慢。是否有任何方法可以轻松缩短执行时间。我不确定，但似乎我输入的内容比实际需要的内容要多/难度更大，如有任何帮助，我们将不胜感激

目前，代码从站点地图开始，然后遍历其他站点地图的列表。在新的站点地图中，它提取数据信息，为网页的json数据构建url。我从json数据中提取一个用于搜索字符串的xml链接。如果找到字符串，它会将其附加到文本文件中

#global variable
start = 'https://www.govinfo.gov/wssearch/getContentDetail?packageId='
dash = '-'
urlSitemap="https://www.govinfo.gov/sitemap/PLAW_sitemap_index.xml"

old_xml=requests.get(urlSitemap)
print (old_xml)
new_xml= io.BytesIO(old_xml.content).read()
final_xml=BeautifulSoup(new_xml)
linkToBeFound = final_xml.findAll('loc')
for loc in linkToBeFound:
    urlPLmap=loc.text
    old_xmlPLmap=requests.get(urlPLmap)
    print(old_xmlPLmap)
    new_xmlPLmap= io.BytesIO(old_xmlPLmap.content).read()
    final_xmlPLmap=BeautifulSoup(new_xmlPLmap)
    linkToBeFound2 = final_xmlPLmap.findAll('loc')
    for pls in linkToBeFound2:
        argh = pls.text.find('PLAW')
        theWanted = pls.text[argh:]
        thisShallWork =eval(requests.get(start + theWanted).text)
        print(requests.get(start + theWanted))
        dict1 = (thisShallWork['download'])
        finaldict = (dict1['modslink'])[2:]
        print(finaldict)
        url2='https://' + finaldict
        try:    
            old_xml4=requests.get(url2)
            print(old_xml4)
            new_xml4= io.BytesIO(old_xml4.content).read()
            final_xml4=BeautifulSoup(new_xml4)
            references = final_xml4.findAll('identifier',{'type': 'Statute citation'})
            for sec in references: 
                if sec.text == "106 Stat. 4845":
                    Print(dash * 20)
                    print(sec.text)
                    Print(dash * 20)
                    sec313 = open('sec313info.txt','a')
                    sec313.write("\n")
                    sec313.write(pls.text + '\n')
                    sec313.close()
        except:
            print('error at: ' + url2)

不知道我为什么花这么长时间在这上面，但我做到了。你的代码真的很难看清楚。所以我从那开始，我把它分成两部分，从网站地图上获取链接，然后是其他东西。我也将一些位分解成单独的函数。这是在我的机器上每秒检查大约2个URL，这似乎是正确的。你可以就这一部分和我争论

不必在每次写入后重新打开和关闭输出文件删除了大量不需要的代码给你的变量起更好的名字，这不会以任何方式提高速度，但请这样做，特别是如果你需要帮助的话真正重要的是。。。一旦你把它全部分解，很明显，让你慢下来的是等待请求，这对于web抓取来说是非常标准的，你可以研究多线程来避免等待。一旦进入多线程，分解代码的好处可能也会变得更加明显。

谢谢，这并没有增加一吨刮时间，但现在的代码实际上是可行的。我希望做得更好哈说出来是做得更好的方法，我自己仍处于那个阶段，必须经常重写/构造我自己的代码。祝你好运，如果速度是一个重要的考虑因素，那就考虑多线程。您可以获得非常好的性能改进，尤其是使用web抓取。

# returns sitemap links
def get_links(s):
    old_xml = requests.get(s)
    new_xml = old_xml.text
    final_xml = BeautifulSoup(new_xml, "lxml")
    return final_xml.findAll('loc')

# gets the final url from your middle url and looks through it for the thing you are looking for
def scrapey(link):
    link_id = link[link.find("PLAW"):]
    r = requests.get('https://www.govinfo.gov/wssearch/getContentDetail?packageId={}'.format(link_id))
    print(r.url)
    try:
        r = requests.get("https://{}".format(r.json()["download"]["modslink"][2:]))
        print(r.url)
        soup = BeautifulSoup(r.text, "lxml")
        references = soup.findAll('identifier', {'type': 'Statute citation'})
        for ref in references:
            if ref.text == "106 Stat. 4845":
                return r.url
        else:
            return False
    except:
        print("bah" + r.url)
        return False


sitemap_links_el = get_links("https://www.govinfo.gov/sitemap/PLAW_sitemap_index.xml")
sitemap_links = map(lambda x: x.text, sitemap_links_el)
nlinks_el = map(get_links, sitemap_links)
links = [num.text for elem in nlinks_el for num in elem]



with open("output.txt", "a") as f:
    for link in links:
        url = scrapey(link)
        if url is False:
            print("no find")
        else:
            print("found on: {}".format(url))
            f.write("{}\n".format(url))