Python 如何使用web抓取请求?
这是我第一次尝试网络爬虫,我正在学习一个教程。到目前为止,我掌握的代码是:Python 如何使用web抓取请求?,python,web-scraping,beautifulsoup,python-requests,Python,Web Scraping,Beautifulsoup,Python Requests,这是我第一次尝试网络爬虫,我正在学习一个教程。到目前为止,我掌握的代码是: from bs4 import BeautifulSoup import requests source = requests.get('https://www.usnews.com/best-colleges/rankings/national-universities') soup = BeautifulSoup(source, 'lxml') print(soup.prettify()) 但是,我得到了一个
from bs4 import BeautifulSoup
import requests
source = requests.get('https://www.usnews.com/best-colleges/rankings/national-universities')
soup = BeautifulSoup(source, 'lxml')
print(soup.prettify())
但是,我得到了一个错误:
Traceback (most recent call last):
File "/Users/alanwen/Desktop/webscrape.py", line 4, in <module>
source = requests.get('https://www.usnews.com/best-colleges/rankings/national-universities')
File "/Users/alanwen/Library/Python/2.7/lib/python/site-packages/requests/api.py", line 76, in get
return request('get', url, params=params, **kwargs)
File "/Users/alanwen/Library/Python/2.7/lib/python/site-packages/requests/api.py", line 61, in request
return session.request(method=method, url=url, **kwargs)
File "/Users/alanwen/Library/Python/2.7/lib/python/site-packages/requests/sessions.py", line 530, in request
resp = self.send(prep, **send_kwargs)
File "/Users/alanwen/Library/Python/2.7/lib/python/site-packages/requests/sessions.py", line 643, in send
r = adapter.send(request, **kwargs)
File "/Users/alanwen/Library/Python/2.7/lib/python/site-packages/requests/adapters.py", line 529, in send
raise ReadTimeout(e, request=request)
requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='www.usnews.com', port=443): Read timed out. (read timeout=None)
[Finished in 25.1s with exit code 1]
[shell_cmd: python -u "/Users/alanwen/Desktop/webscrape.py"]
[dir: /Users/alanwen/Desktop]
[path: /Library/Frameworks/Python.framework/Versions/3.8/bin:/usr/local/bin:/usr/local/sbin:/usr/local/bin:/usr/local/sbin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/share/dotnet:/opt/X11/bin:~/.dotnet/tools:/Library/Frameworks/Mono.framework/Versions/Current/Commands]
回溯(最近一次呼叫最后一次):
文件“/Users/alanwen/Desktop/webscrap.py”,第4行,在
source=请求。get('https://www.usnews.com/best-colleges/rankings/national-universities')
文件“/Users/alanwen/Library/Python/2.7/lib/Python/site-packages/requests/api.py”,第76行,在get中
返回请求('get',url,params=params,**kwargs)
文件“/Users/alanwen/Library/Python/2.7/lib/Python/site packages/requests/api.py”,请求中的第61行
return session.request(method=method,url=url,**kwargs)
文件“/Users/alanwen/Library/Python/2.7/lib/Python/site packages/requests/sessions.py”,请求中的第530行
resp=自我发送(准备,**发送)
文件“/Users/alanwen/Library/Python/2.7/lib/Python/site packages/requests/sessions.py”,第643行,在send中
r=适配器.send(请求,**kwargs)
文件“/Users/alanwen/Library/Python/2.7/lib/Python/site packages/requests/adapters.py”,第529行,在send中
提升读取超时(e,请求=请求)
requests.exceptions.ReadTimeout:HTTPSConnectionPool(host='www.usnews.com',port=443):读取超时。(读取超时=无)
[在25.1s内完成,退出代码为1]
[shell\u cmd:python-u”/Users/alanwen/Desktop/webscrap.py]
[dir:/Users/alanwen/Desktop]
[路径:/Library/Frameworks/Python.framework/Versions/3.8/bin:/usr/local/bin:/usr/local/sbin:/usr/local/sbin:/usr/local/bin:/usr/bin:/sbin:/usr/local/share/dotnet:/opt/X11/bin:~/.dotnet/tools:/Library/Frameworks/Mono.framework/Versions/Current/Commands]
您必须在请求行末尾添加.text以获取网页的实际源代码。而不是
from bs4 import BeautifulSoup
import requests
source = requests.get('https://www.usnews.com/best-colleges/rankings/national-universities')
soup = BeautifulSoup(source, 'lxml')
print(soup.prettify())
做
如果出现错误
bs4.FeatureNotFound:无法找到具有您请求的功能的树生成器:lxml。是否需要安装解析器库?
通过pip安装lxml
模块此页面需要用户代理
标题才能识别浏览器。它甚至可能是不完整的Mozilla/5.0
,但是请求
通常发送python请求/2.23.0
如果没有正确的标头,此服务器会阻止连接,并且在一段时间后,您可能会收到带有的消息“timedout”
,因为请求
无法再等待来自服务器的数据
顺便说一句:BeautifulSoup
需要source.text
或source.content
而不是source
(即请求
对象)
工作代码:
import requests
from bs4 import BeautifulSoup
url = 'https://www.usnews.com/best-colleges/rankings/national-universities'
headers = {'User-Agent': 'Mozilla/5.0'}
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.content, 'lxml')
print(soup.prettify())
顺便说一句:使用该页面,您可以检查发送到服务器的内容
import requests
r = requests.get('https://httpbin.org/get')
#r = requests.post('https://httpbin.org/post')
#r = requests.get('https://httpbin.org/ip')
#r = requests.get('https://httpbin.org/user-agent')
print( r.text )
#print( r.content )
#print( r.json() )
您还可以检查(如果您将url与/get
或/post
一起使用)
或者您可以签入请求对象
print( r.request.headers['User-Agent'] )
您将看到用户代理
{
"args": {},
"headers": {
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate",
"Host": "httpbin.org",
"User-Agent": "python-requests/2.23.0",
"X-Amzn-Trace-Id": "Root=1-5f07c942-067f5b72784a207b31e76ce4"
},
"origin": "83.23.22.221",
"url": "https://httpbin.org/get"
}
python-requests/2.23.0
python-requests/2.23.0
本网站第一次进行网页刮削是一个不好的选择,因为本网站不是一个正常的刮削网站,我认为,您必须使用它:
可能是一个好主意,找到另一个网址,因为这是一个有点'巨大'。零售店/家具店应该是一个很好的例子。示例:如果您只是在学习,并且不局限于python 2,那么只需获取startOh no的价格、图像和名称,然后切换到python 3!有很多Python3指南用于web废弃
print( r.request.headers['User-Agent'] )
{
"args": {},
"headers": {
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate",
"Host": "httpbin.org",
"User-Agent": "python-requests/2.23.0",
"X-Amzn-Trace-Id": "Root=1-5f07c942-067f5b72784a207b31e76ce4"
},
"origin": "83.23.22.221",
"url": "https://httpbin.org/get"
}
python-requests/2.23.0
python-requests/2.23.0
from selenium import webdriver
from bs4 import BeautifulSoup
import pandas as pd
import os
import time
chromedriver = "driver/chromedriver"
os.environ["webdriver.chrome.driver"] = chromedriver
driver = webdriver.Chrome(chromedriver)
url = 'https://www.usnews.com/best-colleges/rankings/national-universities'
driver.get(url)
source = driver.page_source
soup = BeautifulSoup(source, 'lxml')
print(soup.prettify())