Python 美化组和刮削href';s是';行不通
在BeautifulSoup中,我再次遇到刮href的问题。我有一个我正在抓取的页面列表,我有数据,但我似乎无法获得HREF,即使我使用在其他脚本中工作的各种代码 下面是代码,我的数据如下:Python 美化组和刮削href';s是';行不通,python,beautifulsoup,Python,Beautifulsoup,在BeautifulSoup中,我再次遇到刮href的问题。我有一个我正在抓取的页面列表,我有数据,但我似乎无法获得HREF,即使我使用在其他脚本中工作的各种代码 下面是代码,我的数据如下: import requests from bs4 import BeautifulSoup with open('states_names.csv', 'r') as reader: states = [states.strip().replace(' ', '-') for states in
import requests
from bs4 import BeautifulSoup
with open('states_names.csv', 'r') as reader:
states = [states.strip().replace(' ', '-') for states in reader]
url = 'https://www.hauntedplaces.org/state/alabama'
for state in states:
page = requests.get(url+state)
soup = BeautifulSoup(page.text, 'html.parser')
links = soup.findAll('div', class_='description')
# When I try to add .get('href') I get a traceback error. Am I trying to scrape the href too early?
h_page = soup.findAll('h3')
<h3><a href="https://www.hauntedplaces.org/item/gaines-ridge-dinner-club/">Gaines Ridge Dinner Club</a></h3>
<h3><a href="https://www.hauntedplaces.org/item/purifoy-lipscomb-house/">Purifoy-Lipscomb House</a></h3>
<h3><a href="https://www.hauntedplaces.org/item/kate-shepard-house-bed-and-breakfast/">Kate Shepard House Bed and Breakfast</a></h3>
<h3><a href="https://www.hauntedplaces.org/item/cedarhurst-mansion/">Cedarhurst Mansion</a></h3>
<h3><a href="https://www.hauntedplaces.org/item/crybaby-bridge/">Crybaby Bridge</a></h3>
<h3><a href="https://www.hauntedplaces.org/item/gaineswood-plantation/">Gaineswood Plantation</a></h3>
<h3><a href="https://www.hauntedplaces.org/item/mountain-view-hospital/">Mountain View Hospital</a></h3>
导入请求
从bs4导入BeautifulSoup
以open('states_names.csv','r')作为读卡器:
states=[states.strip().replace('',-')表示读取器中的状态]
url='1〕https://www.hauntedplaces.org/state/alabama'
对于州中的州:
page=requests.get(url+状态)
soup=BeautifulSoup(page.text,'html.parser')
links=soup.findAll('div',class='description')
#当我尝试添加.get('href')时,我得到一个回溯错误。我是不是太早就想刮胡子了?
h_page=soup.findAll('h3')
试试看:
soup = BeautifulSoup(page.content, 'html.parser')
list0 = []
possible_links = soup.find_all('a')
for link in possible_links:
if link.has_attr('href'):
print (link.attrs['href'])
list0.append(link.attrs['href'])
print(list0)
这非常有效:
from bs4 import BeautifulSoup
import requests
url = 'https://www.hauntedplaces.org/state/Alabama'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')
for link in soup.select('div.description a'):
print(link['href'])
您需要页面中的所有链接或描述中的所有链接?描述中的所有链接请参见使用lxml更新我的回答,这是否允许使用选择方法“div.description a”?不允许使用“html.parser”。它只是更好更快。它工作得很好!谢谢,嘿,对不起,我没有做得更快。我开车在镇上转来转去吃午饭,没问题。我真的很高兴能帮上忙