Python 删除具有相同标记且没有class和id属性的元素
我想从一个房地产网页上分别获取每套房产的卧室和浴室数量以及土地面积。但是,我发现它们的标记是相同的,即Python 删除具有相同标记且没有class和id属性的元素,python,web-scraping,beautifulsoup,Python,Web Scraping,Beautifulsoup,我想从一个房地产网页上分别获取每套房产的卧室和浴室数量以及土地面积。但是,我发现它们的标记是相同的,即,也没有类和id。因此,当我编写以下代码时: headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36'} url = "https://www.realesta
,也没有类和id。因此,当我编写以下代码时:
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36'}
url = "https://www.realestate.co.nz/residential/sale/auckland?oad=true&pm=1"
response = requests.get(url, headers=headers)
content = BeautifulSoup(response.content, "lxml")
rooms = content.findAll('strong', class_=False, id=False)
for room in rooms:
print(room.text)
我得到以下信息:
Sign up
2
2
2
2
3
2
4
3
2.4ha
2
1
2
2
4
3
465m2
1
1
3
2
1
1
5
3
10.1ha
3
2
5
5
600m2
600m2
4
2
138m2
2
1
2
1
2
2
3
2
675m2
2
1
你可以看到我把它们都放在一起,因为它们有相同的标签。有人能帮我把它们全部分开吗?谢谢 Find main tile意味着div标签,其中包含有关房产的信息,其中一些数据丢失,如区域、浴室等。因此,您可以尝试这种方法
from bs4 import BeautifulSoup
import requests
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36'}
url = "https://www.realestate.co.nz/residential/sale/auckland?oad=true&pm=1"
response = requests.get(url, headers=headers)
content = BeautifulSoup(response.content, "lxml")
rooms = content.find_all('div', attrs={'data-test':"tile"})
dict1={}
for room in rooms:
apart=room.find_all('strong',class_=False)
if len(apart)==3:
for apa in apart:
dict1['bedroom']=apart[0].text
dict1['bathroom']=apart[1].text
dict1['area']=apart[2].text
elif len(apart)==2:
for apa in apart:
dict1['bedroom']=apart[0].text
dict1['bathroom']=apart[1].text
dict1['area']="NA"
else:
for apa in apart:
dict1['bedroom']="NA"
dict1['bathroom']="NA"
dict1['area']=apart[0].text
print(dict1)
输出:
{'bedroom': '2', 'bathroom': '2', 'area': 'NA'}
{'bedroom': '2', 'bathroom': '2', 'area': 'NA'}
{'bedroom': '3', 'bathroom': '2', 'area': 'NA'}
{'bedroom': '4', 'bathroom': '3', 'area': '2.4ha'}
{'bedroom': '2', 'bathroom': '1', 'area': 'NA'}
...
我将在主分幅上循环,并尝试为每个目标节点进行选择,例如,通过html中该分幅的唯一类。您可以使用if-else with test of not-None在缺少的地方添加默认值。为了处理不同的排序顺序,我还添加了一个try-except。我使用了sort by latest,但也测试了您的排序顺序 我又添加了几个项目以提供上下文。可以很容易地将其扩展到循环页面,但这超出了您的问题范围,并且一旦您尝试扩展(如果需要),它将成为新问题的候选
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
import numpy as np
#'https://www.realestate.co.nz/residential/sale/auckland?oad=true&pm=1'
r = requests.get('https://www.realestate.co.nz/residential/sale/auckland?by=latest&oad=true&pm=1',
headers = {'User-Agent':'Mozilla/5.0'}).text
soup = bs(r, 'lxml')
main_listings = soup.select('.listing-tile')
base = 'https://www.realestate.co.nz/4016546/residential/sale/'
results = {}
for listing in main_listings:
try:
date = listing.select_one('.listed-date > span').next_sibling.strip()
except:
date = listing.select_one('.listed-date').text.strip()
title = listing.select_one('h3').text.strip()
listing_id = listing.select_one('a')['id']
url = base + listing_id
bedrooms = listing.select_one('.icon-bedroom + strong')
if bedrooms is not None:
bedrooms = int(bedrooms.text)
else:
bedrooms = np.nan
bathrooms = listing.select_one('.icon-bathroom + strong')
if bathrooms is not None:
bathrooms = int(bathrooms.text)
else:
bathrooms = np.nan
land_area = listing.select_one('icon-land-area + strong')
if land_area is not None:
land_area = land_area.text
else:
land_area = "Not specified"
price = listing.select_one('.text-right').text
results[listing_id] = [date, title, url, bedrooms, bathrooms, land_area, price]
df = pd.DataFrame(results).T
df.columns = ['Listing Date', 'Title', 'Url', '#Bedroom', '#Bathrooms', 'Land Area', 'Price']
print(df)
你能分享一点HTML吗?所有这一切都可能是在一个div中,请尝试将其作为目标。感谢您提供了这个奇妙的解决方案!!虽然在这个阶段,我有点难以消化你的代码,因为我是网络抓取的新手(python…还不错)。我感谢你的时间和努力!非常欢迎并感谢您抽出时间发表评论。谢谢。嗨,巴维亚,我真的很感谢你花时间和精力解决我的问题。它工作起来很容易理解!哦,太好了,你接受了我的回答,谢谢!