Python booking.com的webscraping脚本不';行不通

Python booking.com的webscraping脚本不';行不通,python,python-3.x,web-scraping,beautifulsoup,Python,Python 3.x,Web Scraping,Beautifulsoup,我制作了一个脚本,在这一页上从酒店获取酒店名称、评级和津贴: 这是我的剧本: import numpy as np import time from random import randint import requests from requests import get from bs4 import BeautifulSoup import pandas as pd import re import random headers = { 'User-Agent': 'Mozi

我制作了一个脚本,在这一页上从酒店获取酒店名称、评级和津贴:

这是我的剧本:

import numpy as np


import time
from random import randint
import requests
from requests import get
from bs4 import BeautifulSoup
import pandas as pd
import re
import random

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Firefox/78.0',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
    'Accept-Language': 'fr,fr-FR;q=0.8,en-US;q=0.5,en;q=0.3',
    'Referer': 'https://www.espncricinfo.com/',
    'Upgrade-Insecure-Requests': '1',
    'Connection': 'keep-alive',
    'Pragma': 'no-cache',
    'Cache-Control': 'no-cache',
}

url0 = 'https://www.booking.com/searchresults.fr.html?label=gen173nr-1DCA0oTUIMZWx5c2Vlc3VuaW9uSA1YBGhNiAEBmAENuAEXyAEM2AED6AEB-AECiAIBqAIDuAL_5ZqEBsACAdICJDcxYjgyZmI2LTFlYWQtNGZjOS04Y2U2LTkwNTQyZjI5OWY1YtgCBOACAQ;sid=303509179a2849df63e4d1e5bc1ab1e3;dest_id=-1456928;dest_type=city&'
links1 = []

results = requests.get(url0, headers = headers)


soup = BeautifulSoup(results.text, "html.parser")

links1 = [a['href']  for a in soup.find("div", {"class": "hotellist sr_double_search"}).find_all('a', class_ = 'js-sr-hotel-link hotel_name_link url',  href=True)]
 

root_url = 'https://www.booking.com/'
urls1 = [ '{root}{i}'.format(root=root_url, i=i) for i in links1 ]



pointforts = []
hotels = []
notes = []

for url in urls1: 
    results = requests.get(url)

    soup = BeautifulSoup(results.text, "html.parser")

    try :
        div = soup.find("div", {"class": "hp_desc_important_facilities clearfix hp_desc_important_facilities--bui"})
        pointfort = [x['data-name-en'] for x in div.select('div[class*="important_facility"]')]
        pointforts.append(pointfort)

    except:
        pointforts.append('Nan')

    try:    
        note = soup.find('div', class_ = 'bui-review-score__badge').text
        notes.append(note)

    except:
        notes.append('Nan')
    
    try:
        hotel = soup.find("h2",attrs={"id":"hp_hotel_name"}).text.strip("\n").split("\n")[1]
        hotels.append(hotel)
    except:
        hotels.append('Nan')



data = pd.DataFrame({
    'Notes' : notes,
    'Points fort' : pointforts,
    'Nom' : hotels})


#print(data.head(20))

data.to_csv('datatest.csv', sep=';', index=False, encoding = 'utf_8_sig')
它成功了,我做了一个循环,为酒店刮所有的链接,然后为所有这些酒店刮评级和津贴。但我有doublons,所以不是:
links1=[a['href']表示汤中的a。查找(“div”,“class”:“hotellist sr\u double\u search”})。查找所有('a',href=True)]

我把:
links1=[a['href']作为汤中的a。查找(“div”,“class”:“hotellist sr\u double\u search”})。查找所有('a',class='js sr hotel link hotel\u name\u link url',href=True)]
,正如您在上面的脚本中看到的那样

但现在它不再起作用了,我只获得了
Nan
,而以前,当我有doublons时,我和Nan有一些,但大多数都有我想要的特权和评级。我不明白为什么

以下是酒店链接的html:

下面是获取名称的html(在我获得链接后,脚本将转到此链接):

下面是html,可以获得与酒店相关的所有优惠(如名称、我之前抓取的链接的脚本):

这是我的结果


该网站上的
href
标记包含换行符。一个在开始,另一个在中途。因此,当您尝试组合
根url
时,您将无法获得有效的url

修复方法可以是删除所有换行符。由于href总是以
/
开头,因此也可以从
根url
中删除,或者您可以使用

这将为您提供一个输出CSV文件,开始:

注释;点堡;笔名
8,3 ;[“停车场(需付费)”,“包括免费无线上网”、“家庭客房”、“机场班车”、“无烟客房”、“24小时前台”、“酒吧”];爱丽舍联盟
8,4 ;[‘包括免费无线上网’、‘家庭客房’、‘无烟客房’、‘允许携带宠物’、‘24小时前台’、‘残疾人房间/设施’;巴黎星辰凯悦酒店
8,3 ;[‘包括免费无线上网’、‘家庭客房’、‘无烟客房’、‘允许携带宠物’、‘餐厅’、‘24小时前台’、‘酒吧’];普尔曼巴黎之旅埃菲尔铁塔酒店
8,7 ;[“包括免费无线上网”、“无烟客房”、“餐厅”、“24小时前台”、“残疾人客房/设施”、“电梯”、“酒吧”];巴黎里昂市民酒店
非常感谢!!:)这很微妙,做得很好
import requests
from bs4 import BeautifulSoup
import pandas as pd

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Firefox/78.0',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
    'Accept-Language': 'fr,fr-FR;q=0.8,en-US;q=0.5,en;q=0.3',
    'Referer': 'https://www.espncricinfo.com/',
    'Upgrade-Insecure-Requests': '1',
    'Connection': 'keep-alive',
    'Pragma': 'no-cache',
    'Cache-Control': 'no-cache',
}

url0 = 'https://www.booking.com/searchresults.fr.html?label=gen173nr-1DCA0oTUIMZWx5c2Vlc3VuaW9uSA1YBGhNiAEBmAENuAEXyAEM2AED6AEB-AECiAIBqAIDuAL_5ZqEBsACAdICJDcxYjgyZmI2LTFlYWQtNGZjOS04Y2U2LTkwNTQyZjI5OWY1YtgCBOACAQ;sid=303509179a2849df63e4d1e5bc1ab1e3;dest_id=-1456928;dest_type=city&'

results = requests.get(url0, headers = headers)
soup = BeautifulSoup(results.text, "html.parser")

links1 = [a['href'].replace('\n','')  for a in soup.find("div", {"class": "hotellist sr_double_search"}).find_all('a', class_ = 'js-sr-hotel-link hotel_name_link url',  href=True)]
root_url = 'https://www.booking.com'
urls1 = [f'{root_url}{i}' for i in links1]

pointforts = []
hotels = []
notes = []

for url in urls1: 
    results = requests.get(url)
    soup = BeautifulSoup(results.text, "html.parser")

    try:
        div = soup.find("div", {"class": "hp_desc_important_facilities clearfix hp_desc_important_facilities--bui"})
        pointfort = [x['data-name-en'] for x in div.select('div[class*="important_facility"]')]
        pointforts.append(pointfort)
    except:
        pointforts.append('Nan')

    try:    
        note = soup.find('div', class_ = 'bui-review-score__badge').text
        notes.append(note)
    except:
        notes.append('Nan')
    
    try:
        hotel = soup.find("h2",attrs={"id":"hp_hotel_name"}).text.strip("\n").split("\n")[1]
        hotels.append(hotel)
    except:
        hotels.append('Nan')


data = pd.DataFrame({
    'Notes' : notes,
    'Points fort' : pointforts,
    'Nom' : hotels})

#print(data.head(20))
data.to_csv('datatest.csv', sep=';', index=False, encoding = 'utf_8_sig')