Python-使用BeautifulSoup在一个页面中抓取多个类_Python_Beautifulsoup_Web Crawler

Python-使用BeautifulSoup在一个页面中抓取多个类

python web-crawler

Python-使用BeautifulSoup在一个页面中抓取多个类,python,beautifulsoup,web-crawler,Python,Beautifulsoup,Web Crawler,我正试图抓取Agoda的多个房间类型的每日酒店价格以及其他信息，如促销信息、早餐条件和“立即预订，以后付款”规则我的代码如下： import requests import math from bs4 import BeautifulSoup url = "http://www.agoda.com/ambassador-hotel-taipei/hotel/taipei-tw.html?asq=8m91A1C3D%252bTr%252bvRSmuClW5dm5vJXWO5dlQmHx%252

我正试图抓取Agoda的多个房间类型的每日酒店价格以及其他信息，如促销信息、早餐条件和“立即预订，以后付款”规则

我的代码如下：

import requests
import math
from bs4 import BeautifulSoup

url = "http://www.agoda.com/ambassador-hotel-taipei/hotel/taipei-tw.html?asq=8m91A1C3D%252bTr%252bvRSmuClW5dm5vJXWO5dlQmHx%252fdU9qxilNob5hJg0b218wml6rCgncYsXBK0nWktmYtQJCEMu0P07Y3BjaTYhdrZvavpUnmfy3moWn%252bv8f2Lfx7HovrV95j6mrlCfGou99kE%252bA0aX0aof09AStNs69qUxvAVo53D4ZTrmAxm3bVkqZJr62cU&tyra=1%257c2&searchrequestid=2e2b0e8c-cadb-465b-8dea-2222e24a1678&pingnumber=1&checkin=2015-10-01&los=1"
res = requests.get(url)
soup = BeautifulSoup(res.text, 'html.parser')
n = len(soup.select('.room-name'))

for i in range(0, n):
    en_room = soup.select('.room-name')[i].text.strip()
    currency = soup.select('.currency')[i].text
    price = soup.select('.sellprice')[i].text

    try:
        sp_info = soup.select('.left-room-info')[i].text.strip()
    except Exception as e:
        sp_info = "N/A"

    try:
        pay_later = soup.select('.book-now-paylater')[i].text.strip()
    except Exception as e:
        pay_later = "N/A"


    print en_room, i+1, currency, price, en_room, sp_info, pay_later
    time.sleep(1)

我有两个问题：

（1） “left room info”类似乎包含两个子类“早餐”和“房间促销”。这些子类仅在特定房间类型提供此类服务时显示

当只有一个子类出现时，输出效果良好。但是，当没有子类显示时，当我希望显示“N/A”时，输出为空。另外，当两个子类都出现时，输出格式中有不必要的空行，这些空行不能被.strip（）删除

有没有办法解决这些问题

（2）当我试图从类“.book now Pay Later”中提取信息时，提取的数据与每个房间类型不匹配。例如，假设有10种房间类型，只有房间2、4、6、8允许旅行者现在预订，以后付款，则代码可以提取4条“现在预订，以后付款”信息，但这4条信息随后被不适当地分配给房间类型1、2、3、4

有没有办法解决这个问题

谢谢你的帮助

Gary

在代码中，您没有正确地遍历dom。这将导致刮削过程中出现问题。（例如，第二个问题）。我将给出一个提示性的代码片段（不是精确的解决方案），希望您能够自己解决第一个问题

# select all room types by tables tr tag
room_types = soup.find_all('tr', class_="room-type")

# iterate over the list to scrape data form each td or div inside tr
for room in room_types:
    en_room = room.find('div', class_='room-name').text.strip()

（1）发生这种情况的原因是，即使在

'.left room info'

选择中没有文本，它也不会引发异常，并且您的

除外将永远不会运行。您应该检查该值是否为空字符串（'
）。您可以使用一个简单的（如果不是像这样的字符串
sp_info = soup.select('.left-room-info')[i].text.strip()
if not sp_info:
    sp_info = "N/A"

sp_info = soup.select('.left-room-info')[i].text.strip().split('\r')
if len(sp_info) > 1:
    sp_info = [ info.strip() for info in sp_info ]
elif not sp_info[0]: # check for empty string
    sp_info = ["N/A"] # keep sp_info a list for consistancy 

当两个子类都出现时，您应该在回车符（'\r'
）上拆分字符串，然后剥离每个结果片段。代码如下所示：（注意，现在sp_info是一个列表，而不仅仅是一个字符串）
把这些碎片放在一起，我们会得到这样的东西
sp_info = soup.select('.left-room-info')[i].text.strip()
if not sp_info:
    sp_info = "N/A"

sp_info = soup.select('.left-room-info')[i].text.strip().split('\r')
if len(sp_info) > 1:
    sp_info = [ info.strip() for info in sp_info ]
elif not sp_info[0]: # check for empty string
    sp_info = ["N/A"] # keep sp_info a list for consistancy 

（2） 有点复杂。您必须更改解析页面的方式。也就是说，您可能必须选择on。房间类型
。您选择BookNow pay laters的方式是，它不会将它们与任何其他元素关联，它只选择该类的8个实例。以下是我将如何着手做这件事：
import requests
import math
from bs4 import BeautifulSoup

url = "http://www.agoda.com/ambassador-hotel-taipei/hotel/taipei-tw.html?asq=8m91A1C3D%252bTr%252bvRSmuClW5dm5vJXWO5dlQmHx%252fdU9qxilNob5hJg0b218wml6rCgncYsXBK0nWktmYtQJCEMu0P07Y3BjaTYhdrZvavpUnmfy3moWn%252bv8f2Lfx7HovrV95j6mrlCfGou99kE%252bA0aX0aof09AStNs69qUxvAVo53D4ZTrmAxm3bVkqZJr62cU&tyra=1%257c2&searchrequestid=2e2b0e8c-cadb-465b-8dea-2222e24a1678&pingnumber=1&checkin=2015-10-01&los=1"
res = requests.get(url)
soup = BeautifulSoup(res.text)

rooms = soup.select('.room-type')[1:] # the first instance of the class isn't a room

room_list = []

for room in rooms:
    room_info = {}

    room_info['en_room'] = room.select('.room-name')[0].text.strip()
    room_info['currency'] = room.select('.currency')[0].text.strip()
    room_info['price'] = room.select('.sellprice')[0].text.strip()

    sp_info = room.select('.left-room-info')[0].text.strip().split('\r')
    if len(sp_info) > 1:
        sp_info = ", ".join([ info.strip() for info in sp_info ])
    elif not sp_info[0]: # check for empty string
        sp_info = "N/A"
    room_info['sp_info'] = sp_info

    pay_later = room.select('.book-now-paylater')
    room_info['pay_later'] = pay_later[0].text.strip() if pay_later else "N/A"

    room_list.append(room_info)

您是否尝试将bs4观测值与您在Inspect Element中看到的结果进行映射？非常感谢！你救了我一天！谢谢你的指导。帮了很多忙！