Python booking.com的webscraping脚本不'；行不通_Python_Python 3.x_Web Scraping_Beautifulsoup

Python booking.com的webscraping脚本不'；行不通

python python-3.x web-scraping

Python booking.com的webscraping脚本不'；行不通,python,python-3.x,web-scraping,beautifulsoup,Python,Python 3.x,Web Scraping,Beautifulsoup,我制作了一个脚本，在这一页上从酒店获取酒店名称、评级和津贴：这是我的剧本： import numpy as np import time from random import randint import requests from requests import get from bs4 import BeautifulSoup import pandas as pd import re import random headers = { 'User-Agent': 'Mozi

我制作了一个脚本，在这一页上从酒店获取酒店名称、评级和津贴：

这是我的剧本：

import numpy as np


import time
from random import randint
import requests
from requests import get
from bs4 import BeautifulSoup
import pandas as pd
import re
import random

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Firefox/78.0',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
    'Accept-Language': 'fr,fr-FR;q=0.8,en-US;q=0.5,en;q=0.3',
    'Referer': 'https://www.espncricinfo.com/',
    'Upgrade-Insecure-Requests': '1',
    'Connection': 'keep-alive',
    'Pragma': 'no-cache',
    'Cache-Control': 'no-cache',
}

url0 = 'https://www.booking.com/searchresults.fr.html?label=gen173nr-1DCA0oTUIMZWx5c2Vlc3VuaW9uSA1YBGhNiAEBmAENuAEXyAEM2AED6AEB-AECiAIBqAIDuAL_5ZqEBsACAdICJDcxYjgyZmI2LTFlYWQtNGZjOS04Y2U2LTkwNTQyZjI5OWY1YtgCBOACAQ;sid=303509179a2849df63e4d1e5bc1ab1e3;dest_id=-1456928;dest_type=city&'
links1 = []

results = requests.get(url0, headers = headers)


soup = BeautifulSoup(results.text, "html.parser")

links1 = [a['href']  for a in soup.find("div", {"class": "hotellist sr_double_search"}).find_all('a', class_ = 'js-sr-hotel-link hotel_name_link url',  href=True)]
 

root_url = 'https://www.booking.com/'
urls1 = [ '{root}{i}'.format(root=root_url, i=i) for i in links1 ]



pointforts = []
hotels = []
notes = []

for url in urls1: 
    results = requests.get(url)

    soup = BeautifulSoup(results.text, "html.parser")

    try :
        div = soup.find("div", {"class": "hp_desc_important_facilities clearfix hp_desc_important_facilities--bui"})
        pointfort = [x['data-name-en'] for x in div.select('div[class*="important_facility"]')]
        pointforts.append(pointfort)

    except:
        pointforts.append('Nan')

    try:    
        note = soup.find('div', class_ = 'bui-review-score__badge').text
        notes.append(note)

    except:
        notes.append('Nan')
    
    try:
        hotel = soup.find("h2",attrs={"id":"hp_hotel_name"}).text.strip("\n").split("\n")[1]
        hotels.append(hotel)
    except:
        hotels.append('Nan')



data = pd.DataFrame({
    'Notes' : notes,
    'Points fort' : pointforts,
    'Nom' : hotels})


#print(data.head(20))

data.to_csv('datatest.csv', sep=';', index=False, encoding = 'utf_8_sig')

它成功了，我做了一个循环，为酒店刮所有的链接，然后为所有这些酒店刮评级和津贴。但我有doublons，所以不是：

links1=[a['href']表示汤中的a。查找（“div”，“class”：“hotellist sr\u double\u search”}）。查找所有（'a'，href=True）]

我把：

links1=[a['href']作为汤中的a。查找（“div”，“class”：“hotellist sr\u double\u search”}）。查找所有（'a'，class='js sr hotel link hotel\u name\u link url'，href=True）]

，正如您在上面的脚本中看到的那样

但现在它不再起作用了，我只获得了

Nan

，而以前，当我有doublons时，我和Nan有一些，但大多数都有我想要的特权和评级。我不明白为什么

以下是酒店链接的html：

下面是获取名称的html（在我获得链接后，脚本将转到此链接）：

下面是html，可以获得与酒店相关的所有优惠（如名称、我之前抓取的链接的脚本）：

这是我的结果

该网站上的

href

标记包含换行符。一个在开始，另一个在中途。因此，当您尝试组合

根url

时，您将无法获得有效的url

修复方法可以是删除所有换行符。由于href总是以

开头，因此也可以从

根url

中删除，或者您可以使用

这将为您提供一个输出CSV文件，开始：

注释；点堡；笔名
8,3 ;[“停车场（需付费）”，“包括免费无线上网”、“家庭客房”、“机场班车”、“无烟客房”、“24小时前台”、“酒吧”]；爱丽舍联盟
8,4 ;[‘包括免费无线上网’、‘家庭客房’、‘无烟客房’、‘允许携带宠物’、‘24小时前台’、‘残疾人房间/设施’；巴黎星辰凯悦酒店
8,3 ;[‘包括免费无线上网’、‘家庭客房’、‘无烟客房’、‘允许携带宠物’、‘餐厅’、‘24小时前台’、‘酒吧’]；普尔曼巴黎之旅埃菲尔铁塔酒店
8,7 ;[“包括免费无线上网”、“无烟客房”、“餐厅”、“24小时前台”、“残疾人客房/设施”、“电梯”、“酒吧”]；巴黎里昂市民酒店

非常感谢！！：）这很微妙，做得很好

import requests
from bs4 import BeautifulSoup
import pandas as pd

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Firefox/78.0',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
    'Accept-Language': 'fr,fr-FR;q=0.8,en-US;q=0.5,en;q=0.3',
    'Referer': 'https://www.espncricinfo.com/',
    'Upgrade-Insecure-Requests': '1',
    'Connection': 'keep-alive',
    'Pragma': 'no-cache',
    'Cache-Control': 'no-cache',
}

url0 = 'https://www.booking.com/searchresults.fr.html?label=gen173nr-1DCA0oTUIMZWx5c2Vlc3VuaW9uSA1YBGhNiAEBmAENuAEXyAEM2AED6AEB-AECiAIBqAIDuAL_5ZqEBsACAdICJDcxYjgyZmI2LTFlYWQtNGZjOS04Y2U2LTkwNTQyZjI5OWY1YtgCBOACAQ;sid=303509179a2849df63e4d1e5bc1ab1e3;dest_id=-1456928;dest_type=city&'

results = requests.get(url0, headers = headers)
soup = BeautifulSoup(results.text, "html.parser")

links1 = [a['href'].replace('\n','')  for a in soup.find("div", {"class": "hotellist sr_double_search"}).find_all('a', class_ = 'js-sr-hotel-link hotel_name_link url',  href=True)]
root_url = 'https://www.booking.com'
urls1 = [f'{root_url}{i}' for i in links1]

pointforts = []
hotels = []
notes = []

for url in urls1: 
    results = requests.get(url)
    soup = BeautifulSoup(results.text, "html.parser")

    try:
        div = soup.find("div", {"class": "hp_desc_important_facilities clearfix hp_desc_important_facilities--bui"})
        pointfort = [x['data-name-en'] for x in div.select('div[class*="important_facility"]')]
        pointforts.append(pointfort)
    except:
        pointforts.append('Nan')

    try:    
        note = soup.find('div', class_ = 'bui-review-score__badge').text
        notes.append(note)
    except:
        notes.append('Nan')
    
    try:
        hotel = soup.find("h2",attrs={"id":"hp_hotel_name"}).text.strip("\n").split("\n")[1]
        hotels.append(hotel)
    except:
        hotels.append('Nan')


data = pd.DataFrame({
    'Notes' : notes,
    'Points fort' : pointforts,
    'Nom' : hotels})

#print(data.head(20))
data.to_csv('datatest.csv', sep=';', index=False, encoding = 'utf_8_sig')