使用Python进行web抓取时显示所有搜索结果

使用Python进行web抓取时显示所有搜索结果,python,web-scraping,html-parsing,beautifulsoup,Python,Web Scraping,Html Parsing,Beautifulsoup,我正试图从欧洲议会的立法观察站中获取一份URL列表。我不会键入任何搜索关键字来获取所有文档链接(目前为13172)。我可以很容易地用下面的代码抓取网站上显示的前10个结果的列表。但是,我希望有所有的链接,这样我就不需要以某种方式按下下一页按钮。请让我知道,如果你知道一个方法来实现这一点 import requests, bs4, re # main url of the Legislative Observatory's search site url_main = 'http://www.e

我正试图从欧洲议会的立法观察站中获取一份URL列表。我不会键入任何搜索关键字来获取所有文档链接(目前为13172)。我可以很容易地用下面的代码抓取网站上显示的前10个结果的列表。但是,我希望有所有的链接,这样我就不需要以某种方式按下下一页按钮。请让我知道,如果你知道一个方法来实现这一点

import requests, bs4, re

# main url of the Legislative Observatory's search site
url_main = 'http://www.europarl.europa.eu/oeil/search/search.do?searchTab=y'

# function gets a list of links to the procedures
def links_to_procedures (url_main):
    # requesting html code from the main search site of the Legislative Observatory
    response = requests.get(url_main)
    soup = bs4.BeautifulSoup(response.text) # loading text into Beautiful Soup
    links = [a.attrs.get('href') for a in soup.select('div.procedure_title a')] # getting a list of links of the procedure title
    return links

print(links_to_procedures(url_main))

您可以通过指定
page
GET参数来遵循分页

首先,获取结果计数,然后将计数除以每页的结果计数,计算要处理的页数。然后,逐个迭代页面并收集链接:

import re

from bs4 import BeautifulSoup
import requests

response = requests.get('http://www.europarl.europa.eu/oeil/search/search.do?searchTab=y')
soup = BeautifulSoup(response.content)

# get the results count
num_results = soup.find('span', class_=re.compile('resultNum')).text
num_results = int(re.search('(\d+)', num_results).group(1))
print "Results found: " + str(num_results)

results_per_page = 50
base_url = "http://www.europarl.europa.eu/oeil/search/result.do?page={page}&rows=%s&sort=d&searchTab=y&sortTab=y&x=1411566719001" % results_per_page

links = []
for page in xrange(1, num_results/results_per_page + 1):
    print "Current page: " + str(page)

    url = base_url.format(page=page)
    response = requests.get(url)

    soup = BeautifulSoup(response.content)
    links += [a.attrs.get('href') for a in soup.select('div.procedure_title a')]

print links