我需要获取新闻文章数据。I';我使用的是来自python的request/get,但我得到了这个错误:403禁止
代码如下:我需要获取新闻文章数据。I';我使用的是来自python的request/get,但我得到了这个错误:403禁止,python,html,python-requests,Python,Html,Python Requests,代码如下: from requests import get from bs4 import BeautifulSoup headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36'} url = 'https://business.inquirer.net/catego
from requests import get
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36'}
url = 'https://business.inquirer.net/category/latest-stories/page/10'
response = get(url)
print(response.text[:500])
html_soup = BeautifulSoup(response.text, 'html.parser')
type(html_soup)
这是我得到的结果:
<html>
<head><title>403 Forbidden</title></head>
<body bgcolor="white">
<center><h1>403 Forbidden</h1></center>
<hr><center>nginx</center>
</body>
</html>
403禁止
403禁止
nginx
我已经读到,放一个标题可以解决这个错误,但是我试着放一个我在检查站点时从devtool复制的标题,但是它不能解决我的问题
请帮助我您没有在任何地方使用headers变量,因此,您不会通过请求传递它。您可以使用以下代码来实现这一点:
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
siteurl = "https://business.inquirer.net/category/latest-stories/page/10"
hdr = {'User-Agent': 'Mozilla/5.0'}
req = Request(siteurl,headers=hdr)
page = urlopen(req)
soup = BeautifulSoup(page)
print(soup)
您不会在任何地方使用headers变量,因此,您不会将其与请求一起传递。您可以使用以下代码来实现这一点:
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
siteurl = "https://business.inquirer.net/category/latest-stories/page/10"
hdr = {'User-Agent': 'Mozilla/5.0'}
req = Request(siteurl,headers=hdr)
page = urlopen(req)
soup = BeautifulSoup(page)
print(soup)
使用
BeautifulSoap
尝试从此站点中删除数据时,站点不会显示其数据
当您尝试时:
from bs4 import BeautifulSoup
from urllib import urlopen
url = "https://business.inquirer.net/category/latest-stories/page/10"
open_page = urlopen(url)
source = BeautifulSoup(open_page,"html.parser")
print source
您将看到一行,如:
<p>The owner of this website (business.inquirer.net) has banned your access based on your browser's signature (4af0dedd3eebcb40-ua48).</p>
输出:
TAXATION
DOF clarifies: Rice tariffication law takes effect on March 5
FEBRUARY 19, 2019 BY: BEN O. DE VERA
BANKS
HSBC reports net profit at $12.6B in 2018
FEBRUARY 19, 2019
CURRENCIES
Asian shares gain on hopes for progress on China-US trade
FEBRUARY 19, 2019
ECONOMY
Amro sees higher PH growth in 2019 on easing inflation, infra boost
FEBRUARY 19, 2019 BY: BEN O. DE VERA
TELECOMMUNICATIONS
Poe to DICT: Stop ‘dilly-dallying’ over 3rd telco project
FEBRUARY 19, 2019 BY: CHRISTIA MARIE RAMOS
SOCIAL SECURITY
SSS contribution collections grow by P22.19B in 2018
FEBRUARY 18, 2019 BY: CHRISTIA MARIE RAMOS
STOCKS
World stocks mixed ahead of further China-US trade talks
FEBRUARY 18, 2019
TRADE
Rice tariffication starts on March 3
FEBRUARY 18, 2019 BY: BEN O. DE VERA
AGRICULTURE/AGRIBUSINESS
NFA-Bohol workers wear black to mourn ‘death of the rice industry’
FEBRUARY 18, 2019 BY: LEO UDTOHAN
BONDS
Treasury: RTBs to be sold to individual investors online in Q1
FEBRUARY 18, 2019 BY: BEN O. DE VERA
使用
BeautifulSoap
尝试从此站点中删除数据时,站点不会显示其数据
当您尝试时:
from bs4 import BeautifulSoup
from urllib import urlopen
url = "https://business.inquirer.net/category/latest-stories/page/10"
open_page = urlopen(url)
source = BeautifulSoup(open_page,"html.parser")
print source
您将看到一行,如:
<p>The owner of this website (business.inquirer.net) has banned your access based on your browser's signature (4af0dedd3eebcb40-ua48).</p>
输出:
TAXATION
DOF clarifies: Rice tariffication law takes effect on March 5
FEBRUARY 19, 2019 BY: BEN O. DE VERA
BANKS
HSBC reports net profit at $12.6B in 2018
FEBRUARY 19, 2019
CURRENCIES
Asian shares gain on hopes for progress on China-US trade
FEBRUARY 19, 2019
ECONOMY
Amro sees higher PH growth in 2019 on easing inflation, infra boost
FEBRUARY 19, 2019 BY: BEN O. DE VERA
TELECOMMUNICATIONS
Poe to DICT: Stop ‘dilly-dallying’ over 3rd telco project
FEBRUARY 19, 2019 BY: CHRISTIA MARIE RAMOS
SOCIAL SECURITY
SSS contribution collections grow by P22.19B in 2018
FEBRUARY 18, 2019 BY: CHRISTIA MARIE RAMOS
STOCKS
World stocks mixed ahead of further China-US trade talks
FEBRUARY 18, 2019
TRADE
Rice tariffication starts on March 3
FEBRUARY 18, 2019 BY: BEN O. DE VERA
AGRICULTURE/AGRIBUSINESS
NFA-Bohol workers wear black to mourn ‘death of the rice industry’
FEBRUARY 18, 2019 BY: LEO UDTOHAN
BONDS
Treasury: RTBs to be sold to individual investors online in Q1
FEBRUARY 18, 2019 BY: BEN O. DE VERA
只是为我工作
from bs4 import BeautifulSoup
import urllib.request
response = urllib.request.urlopen('https://business.inquirer.net/category/latest-stories/page/10')
html = response.read()
soup = BeautifulSoup(html,"html5lib")
text = soup.get_text(strip=True)
print (text)
只是为我工作
from bs4 import BeautifulSoup
import urllib.request
response = urllib.request.urlopen('https://business.inquirer.net/category/latest-stories/page/10')
html = response.read()
soup = BeautifulSoup(html,"html5lib")
text = soup.get_text(strip=True)
print (text)
尝试包含标头,许多站点会阻止没有标头的请求:
r = requests.get(url, headers=...)
查看请求文档以了解更多信息:尝试包含标题,许多站点会阻止没有标题的请求:
r = requests.get(url, headers=...)
查看请求文档以了解更多信息: