Python 如何使用web抓取请求?

Python 如何使用web抓取请求?,python,web-scraping,beautifulsoup,python-requests,Python,Web Scraping,Beautifulsoup,Python Requests,这是我第一次尝试网络爬虫,我正在学习一个教程。到目前为止,我掌握的代码是: from bs4 import BeautifulSoup import requests source = requests.get('https://www.usnews.com/best-colleges/rankings/national-universities') soup = BeautifulSoup(source, 'lxml') print(soup.prettify()) 但是,我得到了一个

这是我第一次尝试网络爬虫,我正在学习一个教程。到目前为止,我掌握的代码是:

from bs4 import BeautifulSoup
import requests

source = requests.get('https://www.usnews.com/best-colleges/rankings/national-universities')

soup = BeautifulSoup(source, 'lxml')

print(soup.prettify())
但是,我得到了一个错误:


Traceback (most recent call last):
  File "/Users/alanwen/Desktop/webscrape.py", line 4, in <module>
    source = requests.get('https://www.usnews.com/best-colleges/rankings/national-universities')
  File "/Users/alanwen/Library/Python/2.7/lib/python/site-packages/requests/api.py", line 76, in get
    return request('get', url, params=params, **kwargs)
  File "/Users/alanwen/Library/Python/2.7/lib/python/site-packages/requests/api.py", line 61, in request
    return session.request(method=method, url=url, **kwargs)
  File "/Users/alanwen/Library/Python/2.7/lib/python/site-packages/requests/sessions.py", line 530, in request
    resp = self.send(prep, **send_kwargs)
  File "/Users/alanwen/Library/Python/2.7/lib/python/site-packages/requests/sessions.py", line 643, in send
    r = adapter.send(request, **kwargs)
  File "/Users/alanwen/Library/Python/2.7/lib/python/site-packages/requests/adapters.py", line 529, in send
    raise ReadTimeout(e, request=request)
requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='www.usnews.com', port=443): Read timed out. (read timeout=None)
[Finished in 25.1s with exit code 1]
[shell_cmd: python -u "/Users/alanwen/Desktop/webscrape.py"]
[dir: /Users/alanwen/Desktop]
[path: /Library/Frameworks/Python.framework/Versions/3.8/bin:/usr/local/bin:/usr/local/sbin:/usr/local/bin:/usr/local/sbin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/share/dotnet:/opt/X11/bin:~/.dotnet/tools:/Library/Frameworks/Mono.framework/Versions/Current/Commands]

回溯(最近一次呼叫最后一次):
文件“/Users/alanwen/Desktop/webscrap.py”,第4行,在
source=请求。get('https://www.usnews.com/best-colleges/rankings/national-universities')
文件“/Users/alanwen/Library/Python/2.7/lib/Python/site-packages/requests/api.py”,第76行,在get中
返回请求('get',url,params=params,**kwargs)
文件“/Users/alanwen/Library/Python/2.7/lib/Python/site packages/requests/api.py”,请求中的第61行
return session.request(method=method,url=url,**kwargs)
文件“/Users/alanwen/Library/Python/2.7/lib/Python/site packages/requests/sessions.py”,请求中的第530行
resp=自我发送(准备,**发送)
文件“/Users/alanwen/Library/Python/2.7/lib/Python/site packages/requests/sessions.py”,第643行,在send中
r=适配器.send(请求,**kwargs)
文件“/Users/alanwen/Library/Python/2.7/lib/Python/site packages/requests/adapters.py”,第529行,在send中
提升读取超时(e,请求=请求)
requests.exceptions.ReadTimeout:HTTPSConnectionPool(host='www.usnews.com',port=443):读取超时。(读取超时=无)
[在25.1s内完成,退出代码为1]
[shell\u cmd:python-u”/Users/alanwen/Desktop/webscrap.py]
[dir:/Users/alanwen/Desktop]
[路径:/Library/Frameworks/Python.framework/Versions/3.8/bin:/usr/local/bin:/usr/local/sbin:/usr/local/sbin:/usr/local/bin:/usr/bin:/sbin:/usr/local/share/dotnet:/opt/X11/bin:~/.dotnet/tools:/Library/Frameworks/Mono.framework/Versions/Current/Commands]

您必须在请求行末尾添加.text以获取网页的实际源代码。而不是

from bs4 import BeautifulSoup
import requests

source = requests.get('https://www.usnews.com/best-colleges/rankings/national-universities')

soup = BeautifulSoup(source, 'lxml')

print(soup.prettify())


如果出现错误
bs4.FeatureNotFound:无法找到具有您请求的功能的树生成器:lxml。是否需要安装解析器库?
通过pip安装
lxml
模块此页面需要
用户代理
标题才能识别浏览器。它甚至可能是不完整的
Mozilla/5.0
,但是
请求
通常发送
python请求/2.23.0

如果没有正确的标头,此服务器会阻止连接,并且在一段时间后,您可能会收到带有
的消息“timedout”
,因为
请求
无法再等待来自服务器的数据

顺便说一句:
BeautifulSoup
需要
source.text
source.content
而不是
source
(即
请求
对象)


工作代码:

import requests
from bs4 import BeautifulSoup

url = 'https://www.usnews.com/best-colleges/rankings/national-universities'
headers = {'User-Agent': 'Mozilla/5.0'}

r = requests.get(url, headers=headers)

soup = BeautifulSoup(r.content, 'lxml')

print(soup.prettify())

顺便说一句:使用该页面,您可以检查发送到服务器的内容

import requests

r = requests.get('https://httpbin.org/get')
#r = requests.post('https://httpbin.org/post')
#r = requests.get('https://httpbin.org/ip')
#r = requests.get('https://httpbin.org/user-agent')

print( r.text )
#print( r.content )
#print( r.json() )
您还可以检查(如果您将url与
/get
/post
一起使用)

或者您可以签入请求对象

print( r.request.headers['User-Agent'] )
您将看到
用户代理

{
  "args": {}, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.23.0", 
    "X-Amzn-Trace-Id": "Root=1-5f07c942-067f5b72784a207b31e76ce4"
  }, 
  "origin": "83.23.22.221", 
  "url": "https://httpbin.org/get"
}

python-requests/2.23.0

python-requests/2.23.0

本网站第一次进行网页刮削是一个不好的选择,因为本网站不是一个正常的刮削网站,我认为,您必须使用它:


可能是一个好主意,找到另一个网址,因为这是一个有点'巨大'。零售店/家具店应该是一个很好的例子。示例:如果您只是在学习,并且不局限于python 2,那么只需获取startOh no的价格、图像和名称,然后切换到python 3!有很多Python3指南用于web废弃
print( r.request.headers['User-Agent'] )
{
  "args": {}, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.23.0", 
    "X-Amzn-Trace-Id": "Root=1-5f07c942-067f5b72784a207b31e76ce4"
  }, 
  "origin": "83.23.22.221", 
  "url": "https://httpbin.org/get"
}

python-requests/2.23.0

python-requests/2.23.0
from selenium import webdriver
from bs4 import BeautifulSoup
import pandas as pd
import os
import time


chromedriver = "driver/chromedriver"
os.environ["webdriver.chrome.driver"] = chromedriver
driver = webdriver.Chrome(chromedriver)

url = 'https://www.usnews.com/best-colleges/rankings/national-universities'

driver.get(url)
source = driver.page_source

soup = BeautifulSoup(source, 'lxml')

print(soup.prettify())