Python爬虫的问题_Python_Html - Fatal编程技术网

Python爬虫的问题

python html

Python爬虫的问题,python,html,Python,Html,我是python新手，我正在尝试编写一个爬虫程序，我的问题是我无法获得要在控制台中显示的href链接。如需任何帮助，请参阅下文 import requests from bs4 import BeautifulSoup def trade_spider(max_pages): page = 1 while page <= max_pages: url = 'http://www.rent.ie/houses-to-let/renting_dublin/pa

我是python新手，我正在尝试编写一个爬虫程序，我的问题是我无法获得要在控制台中显示的href链接。如需任何帮助，请参阅下文

import requests
from bs4 import BeautifulSoup

def trade_spider(max_pages):
    page = 1
    while page <= max_pages:
        url = 'http://www.rent.ie/houses-to-let/renting_dublin/page_'+ str(page)
        source_code = requests.get(url)
        plain_text = source_code.text
        soup = BeautifulSoup(plain_text)
        special_divs = soup.findAll('div', {'class':'search_result_title_box'}) 
        for link in special_divs:
            gold = link.findAll('a')
            for link in gold:
                href = gold.get(link['href'])
                print(href)
        page += 1

trade_spider(3)

我不确定你在哪里找到了search\u result\u title\u box类，我会用search\u result类找到元素内部的链接。以下代码适用于我：

import requests
from bs4 import BeautifulSoup


def trade_spider(max_pages):
    """Docstring here."""

    with requests.Session() as session:
        session.headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36"}

        for page in range(1, max_pages):
            url = 'http://www.rent.ie/houses-to-let/renting_dublin/page_{page}'.format(page=page)
            response = session.get(url)

            soup = BeautifulSoup(response.content, "html.parser")
            for link in soup.select(".search_result h2 > a[href]"):
                print(link["href"])

if __name__ == '__main__':
    trade_spider(3)

注意以下改进：

我们使用来提高底层TCP连接的性能，并配置常见的东西，如HTTP头我们使用的是真实浏览器的用户代理字符串我们是 .search_result h2>a[href]是用于匹配搜索结果标题中所需链接的。>表示直接的父子关系。