Python 刮网站与多个链接不带“；“下一步”；按钮使用美丽的汤_Python_Pandas_Web Scraping_Beautifulsoup

Python 刮网站与多个链接不带“；“下一步”；按钮使用美丽的汤

python pandas web-scraping

Python 刮网站与多个链接不带“；“下一步”；按钮使用美丽的汤,python,pandas,web-scraping,beautifulsoup,Python,Pandas,Web Scraping,Beautifulsoup,我对python非常陌生（三天了），我遇到了一个我无法用google/youtube解决的问题。我想抓取所有美国州长的背景数据，并将其保存到csv文件中我已经设法抓取了一个所有州长的列表，但是为了获得更多的细节，我需要分别进入每个州长的页面并保存数据。我在网上找到了代码建议，它使用“下一步”按钮或url结构在多个站点上循环。然而，该网站没有“下一步”按钮，url链接也不遵循循环结构。所以我被卡住了如果能得到任何帮助，我将不胜感激。我想在每个州长页面的主文本上方提取信息（“地址”标签中的办公室

我对python非常陌生（三天了），我遇到了一个我无法用google/youtube解决的问题。我想抓取所有美国州长的背景数据，并将其保存到csv文件中

我已经设法抓取了一个所有州长的列表，但是为了获得更多的细节，我需要分别进入每个州长的页面并保存数据。我在网上找到了代码建议，它使用“下一步”按钮或url结构在多个站点上循环。然而，该网站没有“下一步”按钮，url链接也不遵循循环结构。所以我被卡住了

如果能得到任何帮助，我将不胜感激。我想在每个州长页面的主文本上方提取信息（“地址”标签中的办公室日期、学校等），例如在中

到目前为止，我得到的是：

import bs4 as bs
import urllib.request
import pandas as pd

url = 'https://www.nga.org/cms/FormerGovBios?begincac77e09-db17-41cb-9de0-687b843338d0=10&endcac77e09-db17-41cb-9de0-687b843338d0=9999&pagesizecac77e09-db17-41cb-9de0-687b843338d0=10&militaryService=&higherOfficesServed=&religion=&lastName=&sex=Any&honors=&submit=Search&college=&firstName=&party=&inOffice=Any&biography=&warsServed=&'

sauce = urllib.request.urlopen(url).read()
soup = bs.BeautifulSoup(sauce, "html.parser")

#dl list of all govs
dfs = pd.read_html(url, header=0)
for df in dfs:
    df.to_csv('governors.csv')

#dl links to each gov
table = soup.find('table', 'table table-striped table-striped')
links = table.findAll('a')
with open ('governors_links.csv', 'w') as r:
    for link in links:
        r.write(link['href'])
        r.write('\n')
    r.close()

#enter each gov page and extract data in the "address" tag(s)
#save this in a csv file

我假设您在名为

links

您可以这样做，逐个获取所有调控器的数据：

for link in links:
    r = urllib.request.urlopen(link).read()
    soup = bs.BeautifulSoup(r, 'html.parser')
    print(soup.find('h2').text)  # Name of Governor
    for p in soup.find('div', {'class': 'col-md-3'}).findAll('p'):
        print(p.text.strip())  # Office dates, address, phone, ...
    for p in soup.find('div', {'class': 'col-md-7'}).findAll('p'):
        print(p.text.strip())  # Family, school, birth state, ...

编辑：

将您的

链接

列表更改为

links = ['https://www.nga.org' + x.get('href') for x in table.findAll('a')]

这可能行得通。我还没有测试出来，以完全完成，因为我在工作，但它应该是一个起点，为你

import bs4 as bs
import requests
import re
def is_number(s):
    try:
        int(s)
        return True
    except ValueError:
        return False

def main():
    url = 'https://www.nga.org/cms/FormerGovBios?inOffice=Any&state=Any&party=&lastName=&firstName=&nbrterms=Any&biography=&sex=Any&religion=&race=Any&college=&higherOfficesServed=&militaryService=&warsServed=&honors=&birthState=Any&submit=Search'

    sauce = requests.get(url).text
    soup = bs.BeautifulSoup(sauce, "html.parser")
    finished = False
    csv_data = open('Govs.csv', 'a')
    csv_data.write('Name,Address,OfficeDates,Success,Address,Phone,Fax,Born,BirthState,Party,Schooling,Email')
    try:
        while not finished:
        #dl links to each gov
            table = soup.find('table', 'table table-striped table-striped')
            links = table.findAll('a')
            for link in links:
                info_array = []
                gov = {}
                name = link.string
                gov_sauce =  requests.get(r'https://nga.org'+link.get('href')).text
                gov_soup = bs.BeautifulSoup(gov_sauce, "html.parser")
                #print(gov_soup)
                office_and_stuff_info = gov_soup.findAll('address')
                for address in office_and_stuff_info:
                    infos = address.findAll('p')
                    for info in infos:
                        tex = re.sub('[^a-zA-Z\d:]','',info.text)
                        tex = re.sub('\\s+',' ',info.text)
                        tex = tex.strip()
                        if tex: 
                            info_array.append(tex)
                info_array = list(set(info_array))
                gov['Name'] = name
                secondarry_address = ''
                gov['Address'] = ''
                for line in info_array:
                    if 'OfficeDates:' in line:
                        gov['OfficeDates'] = line.replace('OfficeDates:','').replace('-','')
                    elif 'Succ' or 'Fail' in line:
                        gov['Success'] = line
                    elif 'Address' in line:
                        gov['Address'] = line.replace('Address:','')
                    elif 'Phone:' or 'Phone ' in line:
                        gov['Phone'] = line.replace('Phone ','').replace('Phone: ','')
                    elif 'Fax:' in line:
                        gov['Fax'] = line.replace('Fax:','')
                    elif 'Born:' in line:
                        gov['Born'] = line.replace('Born:','')
                    elif 'Birth State:' in line:
                        gov['BirthState'] = line.replace('BirthState:','')
                    elif 'Party:' in line:
                        gov['Party'] =  line.replace('Party:','')
                    elif 'School(s)' in line:
                        gov['Schooling'] = line.replace('School(s):','').replace('School(s) ')
                    elif 'Email:' in line:
                        gov['Email'] = line.replace('Email:','')
                    else:
                        secondarry_address = line
                gov['Address'] = gov['Address'] + secondarry_address
                data_line = gov['Name'] +','+gov['Address'] +','+gov['OfficeDates'] +','+gov['Success'] +','+gov['Address'] +','+ gov['Phone'] +','+ gov['Fax'] +','+gov['Born'] +','+gov['BirthState'] +','+gov['Party'] +','+gov['Schooling'] +','+gov['Email']
                csv_data.write(data_line)
            next_page_link = soup.find('ul','pagination center-blockdefault').find('a',{'aria-label':'Next'})
            if next_page_link.parent.get('class') == 'disabled':
                finished = True
            else:

                url = r'https://nga.org'+next_page_link.get('href')
                sauce = requests.get(url).text
                soup = bs.BeautifulSoup(sauce,'html.parser')
    except:
        print('Code failed.')
    finally:
        csv_data.close()
if __name__ == '__main__':
    main()

“url链接不遵循循环结构”是什么意思？您正在提取href url--您只需迭代url并使用BeautifulSoup从每个url中提取所需的结构化数据。请尝试此url。它将允许您获取所有数据。我刚刚从url中删除了下一页的部分。试一试：

https://www.nga.org/cms/FormerGovBios?begincac77e09-db17-41cb-9de0-687b843338d0&endcac77e09-db17-41cb-9de0-687b843338d0=319&pagesizecac77e09-db17-41cb-9de0-687b843338d0=10&college=&lastName=&submit=Search&inOffice=Any&sex=Any&Military Service=&传记=&warsServed=&HigherOfficed=&Horns=&Relights=&firstName=&party=&code>谢谢，这是在正确的轨道上！它的工作原理是在“地址”标签中打印内容。但是，当我尝试存储它时，我会出错。这段代码存储不充分（重复和丑陋）：将open（'governors_info.csv'，'w'）作为csvfile:for links in links:r=urllib.request.urlopen（link.read（）soup=bs.beautifulsou（r，'html.parser'）csvfile.write（soup.find（'h2'）.text）#soup中p的政府名称。find（'div'，{'class'：'col-md-3'}）。findAll（'p'）：csvfile.write（p.text.strip（））#办公日期、地址。。。对于soup.find（'div'，{'class'：'col-md-7'}）.findAll（'p'）：csvfile.write（p.text.strip（））#Family，school等csvfile.close（）由于网站的格式设置，它很难看。通过检查..
标记中是否有许多空白，您可以看到。如何使它漂亮是另一个问题。我想已经有答案了。谢谢你的努力。运行30分钟以上后，我得到“ConnectionError:（'Connection aborted.'，RemoteDisconnected（'Remote end closed Connection without response'）”我猜nga.gov网站对于这种方法来说太慢了？或者它进行得太快了，服务器关闭连接以“防御”本身。我试图学习和使用请求会话，但在运行2.5小时后，我得到了以下错误：“links=table.findAll（'a'）AttributeError:'NoneType'对象没有属性'findAll'，这对我来说没有任何意义。：）我更新了我的代码，为每个治理者添加了文件，而不是等待全部写入。这样即使它崩溃了，你仍然会有一些数据。尝试一下您找到的会话修复程序，看看它可以到达多大的距离，但它只在csv文件中写入标题行，然后在写入URL中的信息之前崩溃。我不知道问题出在哪里，因为它被编码为在不起作用时写“代码失败”。