Python刮刀建议
我已经在铲运机上工作了一段时间,现在已经非常接近让它按预期运行。我的代码如下:Python刮刀建议,python,web-scraping,beautifulsoup,urllib,bs4,Python,Web Scraping,Beautifulsoup,Urllib,Bs4,我已经在铲运机上工作了一段时间,现在已经非常接近让它按预期运行。我的代码如下: import urllib.request from bs4 import BeautifulSoup # Crawls main site to get a list of city URLs def getCityLinks(): city_sauce = urllib.request.urlopen('https://www.prodigy-living.co.uk/') # Enter url h
import urllib.request
from bs4 import BeautifulSoup
# Crawls main site to get a list of city URLs
def getCityLinks():
city_sauce = urllib.request.urlopen('https://www.prodigy-living.co.uk/') # Enter url here
city_soup = BeautifulSoup(city_sauce, 'html.parser')
the_city_links = []
for city in city_soup.findAll('div', class_="city-location-menu"):
for a in city.findAll('a', href=True, text=True):
the_city_links.append('https://www.prodigy-living.co.uk/' + a['href'])
return the_city_links
# Crawls each of the city web pages to get a list of unit URLs
def getUnitLinks():
getCityLinks()
for the_city_links in getCityLinks():
unit_sauce = urllib.request.urlopen(the_city_links)
unit_soup = BeautifulSoup(unit_sauce, 'html.parser')
for unit_href in unit_soup.findAll('a', class_="btn white-green icon-right-open-big", href=True):
yield('the_url' + unit_href['href'])
the_unit_links = []
for link in getUnitLinks():
the_unit_links.append(link)
# Soups returns all of the html for the items in the_unit_links
def soups():
for the_links in the_unit_links:
try:
sauce = urllib.request.urlopen(the_links)
for things in sauce:
soup_maker = BeautifulSoup(things, 'html.parser')
yield(soup_maker)
except:
print('Invalid url')
# Below scrapes property name, room type and room price
def getPropNames(soup):
try:
for propName in soup.findAll('div', class_="property-cta"):
for h1 in propName.findAll('h1'):
print(h1.text)
except:
print('Name not found')
def getPrice(soup):
try:
for price in soup.findAll('p', class_="room-price"):
print(price.text)
except:
print('Price not found')
def getRoom(soup):
try:
for theRoom in soup.findAll('div', class_="featured-item-inner"):
for h5 in theRoom.findAll('h5'):
print(h5.text)
except:
print('Room not found')
for soup in soups():
getPropNames(soup)
getPrice(soup)
getRoom(soup)
当我运行这个程序时,它会返回所有URL的所有价格。然而,我没有返回姓名或房间,我也不知道为什么。我真的很感激任何关于这方面的建议,或者改进我的代码的方法——已经学习Python几个月了 我认为你正在抓取的链接最终会将你重定向到另一个网站,在这种情况下,你的抓取功能将不再有用! 例如,伯明翰一个房间的链接会将您重定向到另一个网站 另外,在使用BS中的find和find_all方法时要小心。第一个只返回一个标记,因为当您需要一个属性名称时,而find_all将返回一个列表,允许您获取多个房间价格和类型 无论如何,我简化了你的代码,这就是我遇到你的问题的原因。也许你想从中得到一些启发:
import requests
from bs4 import BeautifulSoup
main_url = "https://www.prodigy-living.co.uk/"
# Getting individual cities url
re = requests.get(main_url)
soup = BeautifulSoup(re.text, "html.parser")
city_tags = soup.find("div", class_ = "footer-city-nav") # Bottom page not loaded dynamycally
cities_links = [main_url+tag["href"] for tag in city_tags.find_all("a")] # Links to cities
# Getting the individual links to the apts
indiv_apts = []
for link in cities_links[0:4]:
print "At link: ", link
re = requests.get(link)
soup = BeautifulSoup(re.text, "html.parser")
links_tags = soup.find_all("a", class_ = "btn white-green icon-right-open-big")
for url in links_tags:
indiv_apts.append(main_url+url.get("href"))
# Now defining your functions
def GetName(tag):
print tag.find("h1").get_text()
def GetType_Price(tags_list):
for tag in tags_list:
print tag.find("h5").get_text()
print tag.find("p", class_ = "room-price").get_text()
# Now scraping teach of the apts - name, price, room.
for link in indiv_apts[0:2]:
print "At link: ", link
re = requests.get(link)
soup = BeautifulSoup(re.text, "html.parser")
property_tag = soup.find("div", class_ = "property-cta")
rooms_tags = soup.find_all("div", class_ = "featured-item")
GetName(property_tag)
GetType_Price(rooms_tags)
您将在lis的第二个元素中看到,您将获得AttributeError,因为您不再在您的网站页面上。事实上:
>>> print indiv_apts[1]
https://www.prodigy-living.co.uk/http://www.iqstudentaccommodation.com/student-accommodation/birmingham/penworks-house?utm_source=prodigylivingwebsite&utm_campaign=birminghampagepenworksbutton&utm_medium=referral # You will not scrape the expected link right at the beginning
下次会有一个精确的问题需要解决,或者在另一种情况下,只需查看代码审查部分
关于查找和查找所有内容:
最后,我想这也回答了你的问题:
干杯:对于python中的web爬行,我强烈建议使用它返回什么?另外,这完全取决于你正在爬行的站点,如果不共享这些信息,我们就无法知道你正在解析的内容是否正确。@ryugie它只返回pricesSorry,如果你不介意看一看的话,我已经修改了这个问题@ryugieThank谢谢你,我没想到会有人重写它,但这很好用!我添加了一个try and except,它按预期运行。很高兴提供帮助:我总是在迭代之前简要检查列表中的内容是的,我注意到有些链接在发布之前会重定向到另一个站点。我想找到一种方法,把它们添加到一个单独的列表中。再次感谢你!