Python 使用BeautifulSoup在多个页面上编写循环_Python_Loops_Beautifulsoup_Mechanize_Bs4

Python 使用BeautifulSoup在多个页面上编写循环

python loops

Python 使用BeautifulSoup在多个页面上编写循环,python,loops,beautifulsoup,mechanize,bs4,Python,Loops,Beautifulsoup,Mechanize,Bs4,我正在尝试从县搜索工具中刮取几页结果：但我似乎不知道如何在第一页之外进行迭代 import csv from mechanize import Browser from bs4 import BeautifulSoup url = 'http://www2.tceq.texas.gov/oce/waci/index.cfm?fuseaction=home.main' br = Browser() br.set_handle_robots(False) br.open(url) br.se

我正在尝试从县搜索工具中刮取几页结果：

但我似乎不知道如何在第一页之外进行迭代

import csv
from mechanize import Browser
from bs4 import BeautifulSoup

url = 'http://www2.tceq.texas.gov/oce/waci/index.cfm?fuseaction=home.main'

br = Browser()
br.set_handle_robots(False)
br.open(url)

br.select_form("county_search_form")

br.form['county_select'] = ['111111111111180']
br.form['start_date_month'] = ['1']
br.form['start_date_day'] = ['1']
br.form['start_date_year'] = ['2014']

br.submit()

soup = BeautifulSoup(br.response())

complaints = soup.find('table', class_='waciList')

output = []

import requests
for i in xrange(1,8):
    page = requests.get("http://www2.tceq.texas.gov/oce/waci/index.cfm?fuseaction=home.search&pageNumber={}".format(i))
    if not page.ok:
        continue
    soup = BeautifulSoup(requests.text)

    for tr in complaints.findAll('tr'):
        print tr
        output_row = []
        for td in tr.findAll('td'):
            output_row.append(td.text.strip())

        output.append(output_row)

br.open(url)
print 'page 2'
complaints = soup.find('table', class_='waciList')

for tr in complaints.findAll('tr'):
    print tr

with open('out-tceq.csv', 'w') as csvfile:
    my_writer = csv.writer(csvfile, delimiter='|')
    my_writer.writerows(output)

我在输出CSV中只获得了第一页的结果。在查看了使用bs4的其他刮取示例后，我尝试添加导入请求循环，但收到错误消息“ImportError:没有名为requests的模块”

有没有想过我应该如何将所有八页的结果循环到.csv中

您实际上不需要

请求

模块来遍历分页搜索结果，

机械化

就足够了。这是使用

mechanize

的一种可能方法

首先，从当前页面获取所有分页链接：

links = br.links(url_regex=r"fuseaction=home.search&pageNumber=")

然后迭代页面链接，打开每个链接并在每次迭代中从每个页面收集有用的信息：

for link in links:
    #open link url:
    br.follow_link(link)

    #print url of current page, just to make sure we are on the expected page:
    print(br.geturl())

    #create soup from HTML of previously opened link:
    soup = BeautifulSoup(br.response())

    #TODO: gather information from current soup object here

您需要首先在代码中使用它，是吗？因此，我在终端上安装了请求模块，并在终端上再次运行代码。它仍然只是在第一个页面上迭代，但它创建的输出文件不包含任何记录。