Python 美化组未返回网页上的所有文本

Python 美化组未返回网页上的所有文本,python,web-scraping,beautifulsoup,Python,Web Scraping,Beautifulsoup,试图浏览网站,但Beautifulsoup在查看网页时不会返回所有可见的文本。请参阅下面的代码: import requests from bs4 import BeautifulSoup f = open("data.txt", "w") url = "https://www.hiltongrandvacations.com/en/resorts-and-destinations" response = requests.get(

试图浏览网站,但Beautifulsoup在查看网页时不会返回所有可见的文本。请参阅下面的代码:

import requests
from bs4 import BeautifulSoup

f = open("data.txt", "w")
url = "https://www.hiltongrandvacations.com/en/resorts-and-destinations"

response = requests.get(url)
soup = BeautifulSoup(response.text, 'html5lib')
f.write(str(soup))
f.close()   
例如,以下文本在网页上可见,但Beautifulsoup不会返回(写入文本文件): 太平洋帕利塞德斯大酒店

我尝试了不同的解析器(html、lxml),但仍然没有得到它。另外,似乎文本不是由Javascript生成的,我可能错了。

可以尝试:

soup = BeautifulSoup(response.content, 'html.parser')
print(soup)

这将返回网页上的所有内容

您看到的数据是通过JavaScript动态加载的。您可以使用此示例加载数据:

import json
import requests


payload = {"locations":[],"amenities":[],"vacationTypes":[],"page":1,"pageSize":9}
api_url = 'https://www.hiltongrandvacations.com/sitecore/api/ssc/apps/PropertySearch'

data = requests.put(api_url, json=payload).json()

# uncomment this to prin all data:
# print(json.dumps(data, indent=4))

# print some info on screen:
for card in data['Cards']:
    print(card['Title'])
    print(card['Description'])
    print('-' * 80)
印刷品:

Sunrise Lodge, a Hilton Grand Vacations Club
Revel in the peak of adventure
--------------------------------------------------------------------------------
The District by Hilton Club
A capital experience in the capital city
--------------------------------------------------------------------------------
The Central at 5th by Hilton Club
At the heart of city life
--------------------------------------------------------------------------------
The Hilton Club – New York
Make a break for the Big Apple.
--------------------------------------------------------------------------------
The Residences by Hilton Club
Wake up in the city that never sleeps.
--------------------------------------------------------------------------------
Grand Pacific Palisades Vacation Resort
A window to the Pacific Ocean. 
--------------------------------------------------------------------------------
Carlsbad Seapointe Resort
A quintessentially Californian vacation
--------------------------------------------------------------------------------
Hilton Grand Vacations Chicago Downtown/Magnificent Mile
A sky-high sanctuary amidst the big-city bustle
--------------------------------------------------------------------------------
Hilton Grand Vacations Club at Trump International Hotel Las Vegas

--------------------------------------------------------------------------------

下面是一个使用selenium解析此网页的示例。它允许您模拟用户行为:等待页面加载,向下滚动到位置,激活位置下拉按钮,选择一个位置(本例中为犹他州),单击它,等待新页面加载并从中提取一些信息

from selenium import webdriver
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait

chrome_options = webdriver.ChromeOptions()
#chrome_options.add_argument('--headless')
#chrome_options.add_argument('--no-sandbox')
wd = webdriver.Chrome('<PATH_TO_CHROME_DRIVER>',chrome_options=chrome_options)

# delay (how long selenium waits for element to be loaded)
DELAY = 30

# maximize browser window
wd.maximize_window()

# load page via selenium
wd.get("https://www.hiltongrandvacations.com/en/resorts-and-destinations")

# wait until results table will be loaded
results = WebDriverWait(wd, DELAY).until(EC.presence_of_element_located((By.XPATH, '//span[contains(text(), "Results")]')))

# find locations button, scroll down to it, click it
locations_button = WebDriverWait(wd, DELAY).until(EC.presence_of_element_located((By.XPATH, '//button[contains(text(), "Locations")]')))
wd.execute_script("arguments[0].scrollIntoView();", locations_button)
wd.execute_script("arguments[0].click();", locations_button)

# find utah checkbox, scroll down to it, click it
utah_checkbox = WebDriverWait(wd, DELAY).until(EC.presence_of_element_located((By.XPATH, '//span[contains(text(), "Utah")]')))
wd.execute_script("arguments[0].scrollIntoView();", utah_checkbox)
wd.execute_script("arguments[0].click();", utah_checkbox)

# find link to utah
utah_link = WebDriverWait(wd, DELAY).until(EC.presence_of_element_located((By.XPATH, '//a[@title="Sunrise Lodge, a Hilton Grand Vacations Club Park City, Utah, Revel in the peak of adventure"]')))
wd.execute_script("arguments[0].scrollIntoView();", utah_link)
wd.execute_script("arguments[0].click();", utah_link)

# find description
description = WebDriverWait(wd, DELAY).until(EC.presence_of_element_located((By.CLASS_NAME, 'image-and-intro__description')))

print(description.text)
从selenium导入webdriver
从selenium.webdriver.support将预期的_条件导入为EC
从selenium.webdriver.common.by导入
从selenium.webdriver.support.ui导入WebDriverWait
chrome\u options=webdriver.ChromeOptions()
#chrome_选项。添加_参数('--headless')
#chrome_选项。添加_参数('--no sandbox')
wd=webdriver.Chrome(“”,Chrome\u选项=Chrome\u选项)
#延迟(selenium等待元素加载的时间)
延迟=30
#最大化浏览器窗口
wd.最大化_窗口()
#通过selenium加载页面
wd.get(“https://www.hiltongrandvacations.com/en/resorts-and-destinations")
#等待结果表将被加载
results=WebDriverWait(wd,DELAY).until(EC.presence_of_元素位于((By.XPATH,//span[contains(text(),“results”)]))
#“查找位置”按钮,向下滚动到它,然后单击它
locations_button=WebDriverWait(wd,DELAY).until(EC.presence_的_元素位于((By.XPATH,//button[contains(text(),“locations”)]))
wd.execute_脚本(“参数[0].scrollIntoView();”,位置_按钮)
wd.execute_脚本(“参数[0]。单击();”,位置_按钮)
#查找犹他复选框,向下滚动到它,单击它
utah_checkbox=WebDriverWait(wd,DELAY).until(EC.presence_的_元素位于((By.XPATH,//span[contains(text(),“utah”)]))
wd.execute_脚本(“参数[0].scrollIntoView();”,复选框)
wd.execute_脚本(“参数[0]。单击();”,复选框)
#查找到犹他州的链接
utah_link=WebDriverWait(wd,DELAY).until(EC.presence_元素的位置((By.XPATH,//a[@title=“Sunrise Lodge,犹他州公园城希尔顿大度假俱乐部,狂欢于冒险之巅”]))
wd.execute_脚本(“参数[0].scrollIntoView();”,链接)
wd.execute_脚本(“参数[0]。单击();”,链接)
#查找描述
description=WebDriverWait(wd,DELAY).until(EC.presence_of_element_located((By.CLASS_NAME,'image-and-intro_description'))
打印(description.text)

如果单独使用selenium还不够,还可以选择将其与BeautifulSoup结合使用。

谢谢Andrej-我将尝试一下,感谢您的帮助。我深入研究了您的解决方案,并通过一些评论指出,您可以通过查看网页上的XHR元素来查看api_url和负载。我想做的是,一旦你从主页上点击了某个属性的链接,就可以抓取该属性的详细信息(例如)。我似乎无法从当前指定的api获取此信息。有什么想法吗?我感谢这里的所有帮助。试着问另一个问题,描述一下你目前的需求,因为你最初问的问题已经解决了@Franz。谢谢你,Alexandra@亚历山德拉