Warning: file_get_contents(/data/phpspider/zhask/data//catemap/8/python-3.x/19.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 刮纸导致403禁止错误_Python_Python 3.x_Web Scraping_Beautifulsoup - Fatal编程技术网

Python 刮纸导致403禁止错误

Python 刮纸导致403禁止错误,python,python-3.x,web-scraping,beautifulsoup,Python,Python 3.x,Web Scraping,Beautifulsoup,我正试图用BeautifulSoup从网上获取每家公司的收入。然而,网站似乎检测到正在使用web刮板?我得到一个“HTTP错误403:禁止” 我试图刮取的页面是: 有人知道如何绕过此问题吗?您应该尝试将用户代理设置为请求头之一。值可以是任何已知浏览器的值 例如: Mozilla/5.0(Macintosh;英特尔Mac OS X 10_12_6)AppleWebKit/537.36(KHTML,像Gecko)Chrome/63.0.3239.132 Safari/537.36我可以使用代理访问网

我正试图用BeautifulSoup从网上获取每家公司的收入。然而,网站似乎检测到正在使用web刮板?我得到一个“HTTP错误403:禁止”

我试图刮取的页面是:


有人知道如何绕过此问题吗?

您应该尝试将
用户代理设置为请求头之一。值可以是任何已知浏览器的值

例如:


Mozilla/5.0(Macintosh;英特尔Mac OS X 10_12_6)AppleWebKit/537.36(KHTML,像Gecko)Chrome/63.0.3239.132 Safari/537.36

我可以使用代理访问网站内容,从这里可以找到:

然后,使用
requests
模块创建playload,您可以刮取站点:

import requests
import re
from bs4 import BeautifulSoup as soup
r = requests.get('https://seekingalpha.com/symbol/AMAT/earnings', proxies={'http':'50.207.31.221:80'}).text
results = re.findall('Revenue of \$[a-zA-Z0-9\.]+', r)
s = soup(r, 'lxml')
titles = list(map(lambda x:x.text, s.find_all('span', {'class':'title-period'})))
epas = list(map(lambda x:x.text, s.find_all('span', {'class':'eps'})))
deciding = list(map(lambda x:x.text, s.find_all('span', {'class':re.compile('green|red')})))
results = list(map(list, zip(titles, epas, results, epas)))
输出:

[[u'Q4: 11-16-17', u'EPS of $0.93 beat by $0.02', u'Revenue of $3.97B', u'EPS of $0.93 beat by $0.02'], [u'Q3: 08-17-17', u'EPS of $0.86 beat by $0.02', u'Revenue of $3.74B', u'EPS of $0.86 beat by $0.02'], [u'Q2: 05-18-17', u'EPS of $0.79 beat by $0.03', u'Revenue of $3.55B', u'EPS of $0.79 beat by $0.03'], [u'Q1: 02-15-17', u'EPS of $0.67 beat by $0.01', u'Revenue of $3.28B', u'EPS of $0.67 beat by $0.01'], [u'Q4: 11-17-16', u'EPS of $0.66 beat by $0.01', u'Revenue of $3.30B', u'EPS of $0.66 beat by $0.01'], [u'Q3: 08-18-16', u'EPS of $0.50 beat by $0.02', u'Revenue of $2.82B', u'EPS of $0.50 beat by $0.02'], [u'Q2: 05-19-16', u'EPS of $0.34 beat by $0.02', u'Revenue of $2.45B', u'EPS of $0.34 beat by $0.02'], [u'Q1: 02-18-16', u'EPS of $0.26 beat by $0.01', u'Revenue of $2.26B', u'EPS of $0.26 beat by $0.01'], [u'Q4: 11-12-15', u'EPS of $0.29  in-line ', u'Revenue of $2.37B', u'EPS of $0.29  in-line '], [u'Q3: 08-13-15', u'EPS of $0.33  in-line ', u'Revenue of $2.49B', u'EPS of $0.33  in-line '], [u'Q2: 05-14-15', u'EPS of $0.29 beat by $0.01', u'Revenue of $2.44B', u'EPS of $0.29 beat by $0.01'], [u'Q1: 02-11-15', u'EPS of $0.27  in-line ', u'Revenue of $2.36B', u'EPS of $0.27  in-line '], [u'Q4: 11-13-14', u'EPS of $0.27  in-line ', u'Revenue of $2.26B', u'EPS of $0.27  in-line '], [u'Q3: 08-14-14', u'EPS of $0.28 beat by $0.01', u'Revenue of $2.27B', u'EPS of $0.28 beat by $0.01'], [u'Q2: 05-15-14', u'EPS of $0.28  in-line ', u'Revenue of $2.35B', u'EPS of $0.28  in-line '], [u'Q1: 02-11-14', u'EPS of $0.23 beat by $0.01', u'Revenue of $2.19B', u'EPS of $0.23 beat by $0.01']]

对于使用PyQuery的任何人:

from pyquery import PyQuery as pq
import requests


page = pq('https://seekingalpha.com/article/4151372-tesla-fools-media-model-s-model-x-demand', proxies={'http':'34.231.147.235:8080'})
print(page)
  • (使用的代理信息来自)
  • 确保您使用的是请求库,而不是Urllib。不要尝试加载带有“urlopen”的页面
谢谢。这个解决方案相当优雅。我只需要弄清楚如何获取该页面上的其他信息,如季度日期、每股收益等。@user172839您还需要哪些其他信息?我只需要该页面中的所有列信息table@user172839季度和EPA?例如,我想要的表格的第一行“第四季度:11-16-17每股收益0.93美元,比0.002美元的收益高出3.97亿美元(+203%),比3000万美元的收益低。有没有一个简单的方法让他们单独出来?(对不起,我是Python新手)。我的最终结果是,我想从这个列表中剔除大量的公司,这样我就可以对结果进行分析“不要尝试加载带有“urlopen”的页面”为什么?你会怎么做?