Python 当网站有文本时,Beauty Soup返回一个空字符串
考虑到本网站: 我想删掉右边标题下的内容。下面是我的示例代码,它应该返回内容列表,但返回空字符串:Python 当网站有文本时,Beauty Soup返回一个空字符串,python,web-scraping,beautifulsoup,python-requests,Python,Web Scraping,Beautifulsoup,Python Requests,考虑到本网站: 我想删掉右边标题下的内容。下面是我的示例代码,它应该返回内容列表,但返回空字符串: import requests as req from bs4 import BeautifulSoup as bs r = req.get('https://dlnr.hawaii.gov/dsp/parks/oahu/ahupuaa-o-kahana-state-park/').text soup = bs(r) par = soup.find('h3', text= 'Facilitie
import requests as req
from bs4 import BeautifulSoup as bs
r = req.get('https://dlnr.hawaii.gov/dsp/parks/oahu/ahupuaa-o-kahana-state-park/').text
soup = bs(r)
par = soup.find('h3', text= 'Facilities')
for sib in par.next_siblings:
print(sib)
这将返回:
<ul class="park_icon">
<div class="clearfix"></div>
</ul>
该网站没有显示该类的任何div元素。此外,列表项也没有被捕获。设施和该框架中的其他信息由
JavaScript
动态加载,因此bs4
在源HTML
中看不到它们,因为它们根本不在那里
但是,您可以查询端点并获得所需的所有信息
以下是方法:
导入json
进口稀土
导入时间
导入请求
标题={
“用户代理”:“Mozilla/5.0(X11;Linux x86_64)”
“AppleWebKit/537.36(KHTML,像壁虎一样)”
“Chrome/90.0.4430.93 Safari/537.36”,
“推荐人”:https://dlnr.hawaii.gov/",
}
端点=f“https://stateparksadmin.ehawaii.gov/camping/park-site.json?parkId=57853&_={int(time.time())}”
response=requests.get(端点,headers=headers).text
data=json.loads(re.search(r“callback\(.*)\);”,response.group(1))
打印(“\n”.join(数据[“公园信息”][“设施”])中的f代表f)
输出:
Boat Ramp
Campsites
Picnic table
Restroom
Showers
Trash Cans
Water Fountain
以下是整个JSON
:
{
"park info": {
"name": "Ahupua\u02bba \u02bbO Kahana State Park",
"id": 57853,
"island": "Oahu",
"activities": [
"Beachgoing",
"Camping",
"Dogs on Leash",
"Fishing",
"Hiking",
"Hunting",
"Sightseeing"
],
"facilities": [
"Boat Ramp",
"Campsites",
"Picnic table",
"Restroom",
"Showers",
"Trash Cans",
"Water Fountain"
],
"prohibited": [
"No Motorized Vehicles/ATV's",
"No Alcoholic Beverages",
"No Open Fires",
"No Smoking",
"No Commercial Activities"
],
"hazards": [],
"photos": [],
"location": {
"latitude": 21.556086,
"longitude": -157.875579
},
"hiking": [
{
"name": "Nakoa Trail",
"id": 17,
"activities": [
"Dogs on Leash",
"Hiking",
"Hunting",
"Sightseeing"
],
"facilities": [
"No Drinking Water"
],
"prohibited": [
"No Bicycles",
"No Open Fires",
"No Littering/Dumping",
"No Camping",
"No Smoking"
],
"hazards": [
"Flash Flood"
],
"photos": [],
"location": {
"latitude": 21.551087,
"longitude": -157.881228
},
"has_google_street": false
},
{
"name": "Kapa\u2018ele\u2018ele Trail",
"id": 18,
"activities": [
"Dogs on Leash",
"Hiking",
"Sightseeing"
],
"facilities": [
"No Drinking Water",
"Restroom",
"Trash Cans"
],
"prohibited": [
"No Bicycles",
"No Open Fires",
"No Littering/Dumping",
"No Camping",
"No Smoking"
],
"hazards": [],
"photos": [],
"location": {
"latitude": 21.554744,
"longitude": -157.876601
},
"has_google_street": false
}
]
}
}
你已经得到了必要的答案,我想我会提供另一种方式的见解,你可以预知发生了什么(除了查看网络流量) 让我们从你的观察开始: 未捕获列表项 通过检查每个li元素,我们可以看到html的形式是
class=“parkicon facilities icon01”
-其中01是一个变量,表示页面上可见的特定图标
快速搜索相关源文件将显示这些编号及其对应的设施参考列在中
https://dlnr.hawaii.gov/dsp/wp-content/themes/hic_state_template_StateParks/js/icon.js
:
var w_fac_icons={“ADA无障碍”:“01”,“船坡道”:“02”,“营地”:“03”,“食品特许”:“04”,“住宿”:“05”,“无饮用水”:“06”,“野餐亭”:“07”,“野餐桌”:“08”,“码头钓鱼”:“09”,“洗手间”:“10”,“淋浴”:“11”,“垃圾桶”:“12”,“步行道”:“13”,“饮水机”:“14”,“礼品店”:“15”,“风景点”:“16”}
如果随后在源html中搜索w\u fac\u图标
,您将遇到(第560-582行):
如果您随后回溯函数parkinfo
,您将到达第446行,在那里您将找到ajax请求,该请求动态获取用于更新网页的json数据:
function parkinfo() {
var campID = 57853;
jQuery.ajax( {
type:'GET',
url: 'https://stateparksadmin.ehawaii.gov/camping/park-site.json',
data:"parkId=" + campID,
数据
可以在查询字符串中使用GET作为参数传递
因此,这就是您在“网络”选项卡中查找的请求。感谢您的回复!我正在用Scrapy做这件事,并在多个页面中爬行。你知道如何使用Scrapy吗?不,对不起,我不使用Scrapy。你在上面的帖子中最初的尝试并没有表明任何关于Scrapy的信息。一旦你的问题得到回答,你就不应该提出任何新的要求。然而,你总是可以在@maverick创建一个新的帖子来描述任何新的问题。我就是这么想的。干得好+1.
var parkfac = parkinfo.facilities;
function parkinfo() {
var campID = 57853;
jQuery.ajax( {
type:'GET',
url: 'https://stateparksadmin.ehawaii.gov/camping/park-site.json',
data:"parkId=" + campID,