Python 刮网站与多个链接不带“;“下一步”;按钮使用美丽的汤

Python 刮网站与多个链接不带“;“下一步”;按钮使用美丽的汤,python,pandas,web-scraping,beautifulsoup,Python,Pandas,Web Scraping,Beautifulsoup,我对python非常陌生(三天了),我遇到了一个我无法用google/youtube解决的问题。我想抓取所有美国州长的背景数据,并将其保存到csv文件中 我已经设法抓取了一个所有州长的列表,但是为了获得更多的细节,我需要分别进入每个州长的页面并保存数据。我在网上找到了代码建议,它使用“下一步”按钮或url结构在多个站点上循环。然而,该网站没有“下一步”按钮,url链接也不遵循循环结构。所以我被卡住了 如果能得到任何帮助,我将不胜感激。我想在每个州长页面的主文本上方提取信息(“地址”标签中的办公室

我对python非常陌生(三天了),我遇到了一个我无法用google/youtube解决的问题。我想抓取所有美国州长的背景数据,并将其保存到csv文件中

我已经设法抓取了一个所有州长的列表,但是为了获得更多的细节,我需要分别进入每个州长的页面并保存数据。我在网上找到了代码建议,它使用“下一步”按钮或url结构在多个站点上循环。然而,该网站没有“下一步”按钮,url链接也不遵循循环结构。所以我被卡住了

如果能得到任何帮助,我将不胜感激。我想在每个州长页面的主文本上方提取信息(“地址”标签中的办公室日期、学校等),例如在中

到目前为止,我得到的是:

import bs4 as bs
import urllib.request
import pandas as pd

url = 'https://www.nga.org/cms/FormerGovBios?begincac77e09-db17-41cb-9de0-687b843338d0=10&endcac77e09-db17-41cb-9de0-687b843338d0=9999&pagesizecac77e09-db17-41cb-9de0-687b843338d0=10&militaryService=&higherOfficesServed=&religion=&lastName=&sex=Any&honors=&submit=Search&college=&firstName=&party=&inOffice=Any&biography=&warsServed=&'

sauce = urllib.request.urlopen(url).read()
soup = bs.BeautifulSoup(sauce, "html.parser")

#dl list of all govs
dfs = pd.read_html(url, header=0)
for df in dfs:
    df.to_csv('governors.csv')

#dl links to each gov
table = soup.find('table', 'table table-striped table-striped')
links = table.findAll('a')
with open ('governors_links.csv', 'w') as r:
    for link in links:
        r.write(link['href'])
        r.write('\n')
    r.close()

#enter each gov page and extract data in the "address" tag(s)
#save this in a csv file

我假设您在名为
links

您可以这样做,逐个获取所有调控器的数据:

for link in links:
    r = urllib.request.urlopen(link).read()
    soup = bs.BeautifulSoup(r, 'html.parser')
    print(soup.find('h2').text)  # Name of Governor
    for p in soup.find('div', {'class': 'col-md-3'}).findAll('p'):
        print(p.text.strip())  # Office dates, address, phone, ...
    for p in soup.find('div', {'class': 'col-md-7'}).findAll('p'):
        print(p.text.strip())  # Family, school, birth state, ...
编辑:

将您的
链接
列表更改为

links = ['https://www.nga.org' + x.get('href') for x in table.findAll('a')]

这可能行得通。我还没有测试出来,以完全完成,因为我在工作,但它应该是一个起点,为你

import bs4 as bs
import requests
import re
def is_number(s):
    try:
        int(s)
        return True
    except ValueError:
        return False

def main():
    url = 'https://www.nga.org/cms/FormerGovBios?inOffice=Any&state=Any&party=&lastName=&firstName=&nbrterms=Any&biography=&sex=Any&religion=&race=Any&college=&higherOfficesServed=&militaryService=&warsServed=&honors=&birthState=Any&submit=Search'

    sauce = requests.get(url).text
    soup = bs.BeautifulSoup(sauce, "html.parser")
    finished = False
    csv_data = open('Govs.csv', 'a')
    csv_data.write('Name,Address,OfficeDates,Success,Address,Phone,Fax,Born,BirthState,Party,Schooling,Email')
    try:
        while not finished:
        #dl links to each gov
            table = soup.find('table', 'table table-striped table-striped')
            links = table.findAll('a')
            for link in links:
                info_array = []
                gov = {}
                name = link.string
                gov_sauce =  requests.get(r'https://nga.org'+link.get('href')).text
                gov_soup = bs.BeautifulSoup(gov_sauce, "html.parser")
                #print(gov_soup)
                office_and_stuff_info = gov_soup.findAll('address')
                for address in office_and_stuff_info:
                    infos = address.findAll('p')
                    for info in infos:
                        tex = re.sub('[^a-zA-Z\d:]','',info.text)
                        tex = re.sub('\\s+',' ',info.text)
                        tex = tex.strip()
                        if tex: 
                            info_array.append(tex)
                info_array = list(set(info_array))
                gov['Name'] = name
                secondarry_address = ''
                gov['Address'] = ''
                for line in info_array:
                    if 'OfficeDates:' in line:
                        gov['OfficeDates'] = line.replace('OfficeDates:','').replace('-','')
                    elif 'Succ' or 'Fail' in line:
                        gov['Success'] = line
                    elif 'Address' in line:
                        gov['Address'] = line.replace('Address:','')
                    elif 'Phone:' or 'Phone ' in line:
                        gov['Phone'] = line.replace('Phone ','').replace('Phone: ','')
                    elif 'Fax:' in line:
                        gov['Fax'] = line.replace('Fax:','')
                    elif 'Born:' in line:
                        gov['Born'] = line.replace('Born:','')
                    elif 'Birth State:' in line:
                        gov['BirthState'] = line.replace('BirthState:','')
                    elif 'Party:' in line:
                        gov['Party'] =  line.replace('Party:','')
                    elif 'School(s)' in line:
                        gov['Schooling'] = line.replace('School(s):','').replace('School(s) ')
                    elif 'Email:' in line:
                        gov['Email'] = line.replace('Email:','')
                    else:
                        secondarry_address = line
                gov['Address'] = gov['Address'] + secondarry_address
                data_line = gov['Name'] +','+gov['Address'] +','+gov['OfficeDates'] +','+gov['Success'] +','+gov['Address'] +','+ gov['Phone'] +','+ gov['Fax'] +','+gov['Born'] +','+gov['BirthState'] +','+gov['Party'] +','+gov['Schooling'] +','+gov['Email']
                csv_data.write(data_line)
            next_page_link = soup.find('ul','pagination center-blockdefault').find('a',{'aria-label':'Next'})
            if next_page_link.parent.get('class') == 'disabled':
                finished = True
            else:

                url = r'https://nga.org'+next_page_link.get('href')
                sauce = requests.get(url).text
                soup = bs.BeautifulSoup(sauce,'html.parser')
    except:
        print('Code failed.')
    finally:
        csv_data.close()
if __name__ == '__main__':
    main()

“url链接不遵循循环结构”是什么意思?您正在提取href url--您只需迭代url并使用BeautifulSoup从每个url中提取所需的结构化数据。请尝试此url。它将允许您获取所有数据。我刚刚从url中删除了下一页的部分。试一试:
https://www.nga.org/cms/FormerGovBios?begincac77e09-db17-41cb-9de0-687b843338d0&endcac77e09-db17-41cb-9de0-687b843338d0=319&pagesizecac77e09-db17-41cb-9de0-687b843338d0=10&college=&lastName=&submit=Search&inOffice=Any&sex=Any&Military Service=&传记=&warsServed=&HigherOfficed=&Horns=&Relights=&firstName=&party=&code>谢谢,这是在正确的轨道上!它的工作原理是在“地址”标签中打印内容。但是,当我尝试存储它时,我会出错。这段代码存储不充分(重复和丑陋):将open('governors_info.csv','w')作为csvfile:for links in links:r=urllib.request.urlopen(link.read()soup=bs.beautifulsou(r,'html.parser')csvfile.write(soup.find('h2').text)#soup中p的政府名称。find('div',{'class':'col-md-3'})。findAll('p'):csvfile.write(p.text.strip())#办公日期、地址。。。对于soup.find('div',{'class':'col-md-7'}).findAll('p'):csvfile.write(p.text.strip())#Family,school等csvfile.close()由于网站的格式设置,它很难看。通过检查
..

标记中是否有许多空白,您可以看到。如何使它漂亮是另一个问题。我想已经有答案了。谢谢你的努力。运行30分钟以上后,我得到“ConnectionError:('Connection aborted.',RemoteDisconnected('Remote end closed Connection without response')”我猜nga.gov网站对于这种方法来说太慢了?或者它进行得太快了,服务器关闭连接以“防御”本身。我试图学习和使用请求会话,但在运行2.5小时后,我得到了以下错误:“links=table.findAll('a')AttributeError:'NoneType'对象没有属性'findAll',这对我来说没有任何意义。:)我更新了我的代码,为每个治理者添加了文件,而不是等待全部写入。这样即使它崩溃了,你仍然会有一些数据。尝试一下您找到的会话修复程序,看看它可以到达多大的距离,但它只在csv文件中写入标题行,然后在写入URL中的信息之前崩溃。我不知道问题出在哪里,因为它被编码为在不起作用时写“代码失败”。