Python 用漂亮的汤和巨蟒来刮食错误:';href';
我收到Python 用漂亮的汤和巨蟒来刮食错误:';href';,python,web-scraping,beautifulsoup,screen-scraping,Python,Web Scraping,Beautifulsoup,Screen Scraping,我收到keyrerror:'href'。我想这是因为我的属性没有定义,我试图找到一个解决方案,但到目前为止都没有成功。我的代码如下: import requests from bs4 import BeautifulSoup main_url = "https://www.chapter-living.com/properties/highbury/" re = requests.get(main_url) soup = BeautifulSoup(re.text, "html.parser"
keyrerror:'href'
。我想这是因为我的属性没有定义,我试图找到一个解决方案,但到目前为止都没有成功。我的代码如下:
import requests
from bs4 import BeautifulSoup
main_url = "https://www.chapter-living.com/properties/highbury/"
re = requests.get(main_url)
soup = BeautifulSoup(re.text, "html.parser")
city_tags = soup.find_all('h2', class_="title") # The section containing the links to the cities
cities_links = [main_url + tag['href'] for tag in city_tags] # Iterates through city_tags and stores them in a [list]
调用
cities\u链接时出错
h2
标记没有href
属性。属于a
标记的。这就是为什么会出现此错误,因为您试图访问一个不存在的属性
import requests
from bs4 import BeautifulSoup
main_url = "http://www.chapter-living.com/properties/highbury"
re = requests.get(main_url)
soup = BeautifulSoup(re.text, "html.parser")
city_tags = soup.find_all('h2', class_="title")
cities_links = [main_url + tag.find('a').get('href','') if tag.find('a') else '' for tag in city_tags]
print cities_links
这将导致:
[u'http://www.chapter-living.com/properties/highbury/properties/highbury/rooms/bronze-en-suite/', u'http://www.chapter-living.com/properties/highbury/properties/highbury/rooms/silver-en-suite/', u'http://www.chapter-living.com/properties/highbury/properties/highbury/rooms/bronze-studio/', u'http://www.chapter-living.com/properties/highbury/properties/highbury/rooms/bronze-premium-studio/', u'http://www.chapter-living.com/properties/highbury/properties/highbury/rooms/silver-studio/', u'http://www.chapter-living.com/properties/highbury/properties/highbury/rooms/gold-studio/', u'http://www.chapter-living.com/properties/highbury/properties/highbury/rooms/platinum-studio/', u'http://www.chapter-living.com/properties/highbury/properties/highbury/rooms/two-bed-flat/', '', '', '', '', '', '']
或者,您可以使用比美化组快一个数量级的lxml模块:
import requests
from lxml import html
main_url = "http://www.chapter-living.com/properties/highbury"
re = requests.get(main_url)
root = html.fromstring(re.content)
cities_links = [main_url + link for link in root.xpath('//h2[@class="title"]/a/@href')]
print cities_links
这将导致:
['http://www.chapter-living.com/properties/highbury/properties/highbury/rooms/bronze-en-suite/', 'http://www.chapter-living.com/properties/highbury/properties/highbury/rooms/silver-en-suite/', 'http://www.chapter-living.com/properties/highbury/properties/highbury/rooms/bronze-studio/', 'http://www.chapter-living.com/properties/highbury/properties/highbury/rooms/bronze-premium-studio/', 'http://www.chapter-living.com/properties/highbury/properties/highbury/rooms/silver-studio/', 'http://www.chapter-living.com/properties/highbury/properties/highbury/rooms/gold-studio/', 'http://www.chapter-living.com/properties/highbury/properties/highbury/rooms/platinum-studio/', 'http://www.chapter-living.com/properties/highbury/properties/highbury/rooms/two-bed-flat/']
谢谢,非常感谢!我不确定我是否同意这一点。您会说它们存储在哪里…?我会说它们存储在
a
标签中。这就是为什么上面的响应执行tag.find('a')
,因为h2
标记没有href
属性。