Python 美化组未返回网页上的所有文本_Python_Web Scraping_Beautifulsoup

Python 美化组未返回网页上的所有文本

python web-scraping

Python 美化组未返回网页上的所有文本,python,web-scraping,beautifulsoup,Python,Web Scraping,Beautifulsoup,试图浏览网站，但Beautifulsoup在查看网页时不会返回所有可见的文本。请参阅下面的代码： import requests from bs4 import BeautifulSoup f = open("data.txt", "w") url = "https://www.hiltongrandvacations.com/en/resorts-and-destinations" response = requests.get(

试图浏览网站，但Beautifulsoup在查看网页时不会返回所有可见的文本。请参阅下面的代码：

import requests
from bs4 import BeautifulSoup

f = open("data.txt", "w")
url = "https://www.hiltongrandvacations.com/en/resorts-and-destinations"

response = requests.get(url)
soup = BeautifulSoup(response.text, 'html5lib')
f.write(str(soup))
f.close()

例如，以下文本在网页上可见，但Beautifulsoup不会返回（写入文本文件）：太平洋帕利塞德斯大酒店

我尝试了不同的解析器（html、lxml），但仍然没有得到它。另外，似乎文本不是由Javascript生成的，我可能错了。

可以尝试：

soup = BeautifulSoup(response.content, 'html.parser')
print(soup)

这将返回网页上的所有内容

您看到的数据是通过JavaScript动态加载的。您可以使用此示例加载数据：

import json
import requests


payload = {"locations":[],"amenities":[],"vacationTypes":[],"page":1,"pageSize":9}
api_url = 'https://www.hiltongrandvacations.com/sitecore/api/ssc/apps/PropertySearch'

data = requests.put(api_url, json=payload).json()

# uncomment this to prin all data:
# print(json.dumps(data, indent=4))

# print some info on screen:
for card in data['Cards']:
    print(card['Title'])
    print(card['Description'])
    print('-' * 80)

印刷品：

Sunrise Lodge, a Hilton Grand Vacations Club
Revel in the peak of adventure
--------------------------------------------------------------------------------
The District by Hilton Club
A capital experience in the capital city
--------------------------------------------------------------------------------
The Central at 5th by Hilton Club
At the heart of city life
--------------------------------------------------------------------------------
The Hilton Club – New York
Make a break for the Big Apple.
--------------------------------------------------------------------------------
The Residences by Hilton Club
Wake up in the city that never sleeps.
--------------------------------------------------------------------------------
Grand Pacific Palisades Vacation Resort
A window to the Pacific Ocean. 
--------------------------------------------------------------------------------
Carlsbad Seapointe Resort
A quintessentially Californian vacation
--------------------------------------------------------------------------------
Hilton Grand Vacations Chicago Downtown/Magnificent Mile
A sky-high sanctuary amidst the big-city bustle
--------------------------------------------------------------------------------
Hilton Grand Vacations Club at Trump International Hotel Las Vegas

--------------------------------------------------------------------------------

下面是一个使用selenium解析此网页的示例。它允许您模拟用户行为：等待页面加载，向下滚动到位置，激活位置下拉按钮，选择一个位置（本例中为犹他州），单击它，等待新页面加载并从中提取一些信息

from selenium import webdriver
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait

chrome_options = webdriver.ChromeOptions()
#chrome_options.add_argument('--headless')
#chrome_options.add_argument('--no-sandbox')
wd = webdriver.Chrome('<PATH_TO_CHROME_DRIVER>',chrome_options=chrome_options)

# delay (how long selenium waits for element to be loaded)
DELAY = 30

# maximize browser window
wd.maximize_window()

# load page via selenium
wd.get("https://www.hiltongrandvacations.com/en/resorts-and-destinations")

# wait until results table will be loaded
results = WebDriverWait(wd, DELAY).until(EC.presence_of_element_located((By.XPATH, '//span[contains(text(), "Results")]')))

# find locations button, scroll down to it, click it
locations_button = WebDriverWait(wd, DELAY).until(EC.presence_of_element_located((By.XPATH, '//button[contains(text(), "Locations")]')))
wd.execute_script("arguments[0].scrollIntoView();", locations_button)
wd.execute_script("arguments[0].click();", locations_button)

# find utah checkbox, scroll down to it, click it
utah_checkbox = WebDriverWait(wd, DELAY).until(EC.presence_of_element_located((By.XPATH, '//span[contains(text(), "Utah")]')))
wd.execute_script("arguments[0].scrollIntoView();", utah_checkbox)
wd.execute_script("arguments[0].click();", utah_checkbox)

# find link to utah
utah_link = WebDriverWait(wd, DELAY).until(EC.presence_of_element_located((By.XPATH, '//a[@title="Sunrise Lodge, a Hilton Grand Vacations Club Park City, Utah, Revel in the peak of adventure"]')))
wd.execute_script("arguments[0].scrollIntoView();", utah_link)
wd.execute_script("arguments[0].click();", utah_link)

# find description
description = WebDriverWait(wd, DELAY).until(EC.presence_of_element_located((By.CLASS_NAME, 'image-and-intro__description')))

print(description.text)

从selenium导入webdriver
从selenium.webdriver.support将预期的_条件导入为EC
从selenium.webdriver.common.by导入
从selenium.webdriver.support.ui导入WebDriverWait
chrome\u options=webdriver.ChromeOptions（）
#chrome_选项。添加_参数（'--headless'）
#chrome_选项。添加_参数（'--no sandbox'）
wd=webdriver.Chrome（“”，Chrome\u选项=Chrome\u选项）
#延迟（selenium等待元素加载的时间）
延迟=30
#最大化浏览器窗口
wd.最大化_窗口（）
#通过selenium加载页面
wd.get（“https://www.hiltongrandvacations.com/en/resorts-and-destinations")
#等待结果表将被加载
results=WebDriverWait（wd，DELAY）.until（EC.presence_of_元素位于（（By.XPATH，//span[contains（text（），“results”）]））
#“查找位置”按钮，向下滚动到它，然后单击它
locations_button=WebDriverWait（wd，DELAY）.until（EC.presence_的_元素位于（（By.XPATH，//button[contains（text（），“locations”）]））
wd.execute_脚本（“参数[0].scrollIntoView（）；”，位置_按钮）
wd.execute_脚本（“参数[0]。单击（）；”，位置_按钮）
#查找犹他复选框，向下滚动到它，单击它
utah_checkbox=WebDriverWait（wd，DELAY）.until（EC.presence_的_元素位于（（By.XPATH，//span[contains（text（），“utah”）]））
wd.execute_脚本（“参数[0].scrollIntoView（）；”，复选框）
wd.execute_脚本（“参数[0]。单击（）；”，复选框）
#查找到犹他州的链接
utah_link=WebDriverWait（wd，DELAY）.until（EC.presence_元素的位置（（By.XPATH，//a[@title=“Sunrise Lodge，犹他州公园城希尔顿大度假俱乐部，狂欢于冒险之巅”]））
wd.execute_脚本（“参数[0].scrollIntoView（）；”，链接）
wd.execute_脚本（“参数[0]。单击（）；”，链接）
#查找描述
description=WebDriverWait（wd，DELAY）.until（EC.presence_of_element_located（（By.CLASS_NAME，'image-and-intro_description'））
打印（description.text）

如果单独使用selenium还不够，还可以选择将其与BeautifulSoup结合使用。

谢谢Andrej-我将尝试一下，感谢您的帮助。我深入研究了您的解决方案，并通过一些评论指出，您可以通过查看网页上的XHR元素来查看api_url和负载。我想做的是，一旦你从主页上点击了某个属性的链接，就可以抓取该属性的详细信息（例如）。我似乎无法从当前指定的api获取此信息。有什么想法吗？我感谢这里的所有帮助。试着问另一个问题，描述一下你目前的需求，因为你最初问的问题已经解决了@Franz。谢谢你，Alexandra@亚历山德拉