Python 尝试提取URL时使用Urllibopener时引发HTTP错误
我正试图建立一个脚本,从Craigslist上搜刮马自达Miata。当函数“extract\u post\u url”尝试请求时,我遇到一个错误。以下是我尝试遵循的教程: 以下是迄今为止的代码:Python 尝试提取URL时使用Urllibopener时引发HTTP错误,python,selenium,web-scraping,Python,Selenium,Web Scraping,我正试图建立一个脚本,从Craigslist上搜刮马自达Miata。当函数“extract\u post\u url”尝试请求时,我遇到一个错误。以下是我尝试遵循的教程: 以下是迄今为止的代码: from selenium import webdriver from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
import urllib.request
class CraigslistScaper(object):
def __init__(self,query,location,max_price,transmission):
self.query = query
# self.sort=sort
self.location = location
# self.postal = postal
self.max_price = max_price
self.transmission = auto_transmission
#https://sfbay.craigslist.org/search/cta?query=mazda+miata&sort=rel&max_price=6000&auto_transmission=1
self.url = f"https://{location}.craigslist.org/search/cta?query={query}&sort=rel&max_price={max_price}&auto_transmission={transmission}"
self.driver = webdriver.Chrome('/Users/MyLaptop/Desktop/chromedriver')
self.delay = 5
def load_craigslist_url(self):
self.driver.get(self.url)
try:
wait = WebDriverWait(self.driver, self.delay)
wait.until(EC.presence_of_element_located((By.ID,"searchform")))
print("page is ready")
except TimeoutError:
print('Loading took too much time')
def extract_post_titles(self):
all_posts = self.driver.find_elements_by_class_name('result-row')
post_titles_list=[]
for post in all_posts:
print(post.text)
post_titles_list.append(post.text)
def extract_post_urls(self):
url_list = []
# req = Request(self.url)
html_page = urllib.request.urlopen(self.url)
soup = BeautifulSoup(html_page,'lxml')
for link in soup.findAll("a ", {"class": "result-title hrdlnk"}):
print(link["href"])
url_list.append(["href"])
return url_list
def quit(self):
self.driver.close()
location = "sfbay"
#postal = "94519"
max_price = "5000"
#radius = "250"
auto_transmission = 1
query = "Mazda Miata"
scraper = CraigslistScaper(query,location,max_price,auto_transmission)
scraper.load_craigslist_url()
scraper.extract_post_titles()
scraper.extract_post_urls()
scraper.quit()
下面是我得到的错误:
File "<ipython-input-2-edb38e647dc0>", line 1, in <module>
runfile('/Users/MyLaptop/.spyder-py3/CraigslistScraper', wdir='/Users/MohitAsthana/.spyder-py3')
File "/anaconda3/lib/python3.6/site-packages/spyder/utils/site/sitecustomize.py", line 705, in runfile
execfile(filename, namespace)
File "/anaconda3/lib/python3.6/site-packages/spyder/utils/site/sitecustomize.py", line 102, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "/Users/MyLaptop/.spyder-py3/CraigslistScraper", line 73, in <module>
scraper.extract_post_urls()
File "/Users/MyLaptop/.spyder-py3/CraigslistScraper", line 52, in extract_post_urls
html_page = urllib.request.urlopen(req)
File "/anaconda3/lib/python3.6/urllib/request.py", line 223, in urlopen
return opener.open(url, data, timeout)
File "/anaconda3/lib/python3.6/urllib/request.py", line 532, in open
response = meth(req, response)
File "/anaconda3/lib/python3.6/urllib/request.py", line 642, in http_response
'http', request, response, code, msg, hdrs)
File "/anaconda3/lib/python3.6/urllib/request.py", line 570, in error
return self._call_chain(*args)
File "/anaconda3/lib/python3.6/urllib/request.py", line 504, in _call_chain
result = func(*args)
File "/anaconda3/lib/python3.6/urllib/request.py", line 650, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
HTTPError: Bad Request
文件“”,第1行,在
运行文件('/Users/MyLaptop/.spyder-py3/CraigslistScraper',wdir='/Users/MohitAsthana/.spyder-py3')
文件“/anaconda3/lib/python3.6/site packages/spyder/utils/site/sitecustomize.py”,第705行,在runfile中
execfile(文件名、命名空间)
文件“/anaconda3/lib/python3.6/site packages/spyder/utils/site/sitecustomize.py”,第102行,在execfile中
exec(编译(f.read(),文件名,'exec'),命名空间)
文件“/Users/MyLaptop/.spyder-py3/CraigslistScraper”,第73行,在
scraper.extract_post_url()
文件“/Users/MyLaptop/.spyder-py3/craigsliststrapper”,第52行,在extract\u post\u URL中
html_page=urllib.request.urlopen(req)
urlopen中的文件“/anaconda3/lib/python3.6/urllib/request.py”,第223行
返回opener.open(url、数据、超时)
打开文件“/anaconda3/lib/python3.6/urllib/request.py”,第532行
响应=方法(请求,响应)
文件“/anaconda3/lib/python3.6/urllib/request.py”,第642行,在http\u响应中
“http”、请求、响应、代码、消息、hdrs)
文件“/anaconda3/lib/python3.6/urllib/request.py”,第570行出错
返回自我。调用链(*args)
文件“/anaconda3/lib/python3.6/urllib/request.py”,第504行,在调用链中
结果=func(*args)
文件“/anaconda3/lib/python3.6/urllib/request.py”,第650行,默认为http\u error\u
raise HTTPError(请求完整的url、代码、消息、hdrs、fp)
HTTPError:请求错误
Chrome打开了正确的URL,但当我下载URL文件时,它会出错 2该行的问题:
self.url = f"https://{location}.craigslist.org/search/cta?query={query}&sort=rel&max_price={max_price}&auto_transmission={transmission}"
“https://{location}.craigslist.org/search/cta?query={query}&sort=rel&max\u price={max\u price}&auto\u transmission={transmission}”不是一个有效的URL-您在什么时候将您的值(例如:self.transmission
)替换为该字符串
self.url = "https://{}.craigslist.org/search/cta?query={}&sort=rel&max_price={}&auto_transmission={}".format(self.location, self.query, self.max_price, self.transmission)
看看这是否有帮助。如果没有-打印url而不是请求它。你能发布整个错误吗?我用你键入的行替换了self.url行,它仍然会引发HTTP:Bad请求错误。我在运行代码之前创建了变量。“Query=Mazda Miata”等等,所以我认为这会取代它。至于f“废话”,这就是教程所做的,我只是在遵循它。@Daniellm。看,这是一个很好的新功能。@业余爱好者您的查询中有一个空格。要么对其进行URL编码,要么替换为+eg
query=“Mazda+Miata”