Python 3.x 使用Python进行网页抓取的初学者。本网站是否有防刮功能?
我正在尝试做一个自动化的每日网络抓取 但我得到的结果是空的列表。我认为网站上可能有某种保护措施,防止被刮掉 我使用了一些教程来尝试使用BeautifulSoup4和XPath来抓取站点,但这两种方法都给我留下了空列表。我确实在某一点上得到了一个403禁止的错误,但找到了一个使用“hdr={'User-Agent':'Mozilla/5.0'}”的解决方法(不管这意味着什么)。我不熟悉网页抓取,所以我不确定 BeautifulSoup4版本得到了结果,但没有我正在寻找的实际数据Python 3.x 使用Python进行网页抓取的初学者。本网站是否有防刮功能?,python-3.x,Python 3.x,我正在尝试做一个自动化的每日网络抓取 但我得到的结果是空的列表。我认为网站上可能有某种保护措施,防止被刮掉 我使用了一些教程来尝试使用BeautifulSoup4和XPath来抓取站点,但这两种方法都给我留下了空列表。我确实在某一点上得到了一个403禁止的错误,但找到了一个使用“hdr={'User-Agent':'Mozilla/5.0'}”的解决方法(不管这意味着什么)。我不熟悉网页抓取,所以我不确定 BeautifulSoup4版本得到了结果,但没有我正在寻找的实际数据 url = "ht
url = "https://www.cmegroup.com/trading/agricultural/dairy/cash-settled-butter_quotes_globex.html"
hdr = {'User-Agent': 'Mozilla/5.0'}
req = Request(url,headers=hdr)
page = urlopen(req)
soup = BeautifulSoup(page)
print(soup.prettify())
Xpath版本似乎可以连接,但不能传递数据
from lxml import html
import requests
url = "https://www.cmegroup.com/trading/agricultural/dairy/cash-settled-butter_quotes_globex.html"
response = requests.get(url)
tree = html.fromstring(response.content)
data = tree.xpath('//*[@id="quotesFuturesProductTable1"]/tbody/tr[1]/th/span')
data
我想提取姓名、月份和之前的结算。然后最终找出如何让它每天自动提取数据
我做错了什么?您在网页上看到的数据是通过Javascript动态加载的。BeautifulSoup在这里帮不了你,因为它不执行Javascript 例如,您可以使用
selenium
。或者使用re
和json
模块手动解析数据。此代码将加载json格式的数据并将其打印到屏幕上:
import re
import json
import requests
url = 'https://www.cmegroup.com/trading/agricultural/dairy/cash-settled-butter_quotes_globex.html'
data_url = 'https://www.cmegroup.com' + re.findall(r'component\.url = "(.*?)"', requests.get(url).text)[0]
json_data = requests.get(data_url).json()
print(json.dumps(json_data, indent=4))
印刷品:
{
"quoteDelayed": true,
"quoteDelay": "10 minutes",
"tradeDate": "14 Aug 2019",
"quotes": [
{
"last": "235.850",
"change": "+0.800",
"priorSettle": "235.050",
"open": "235.050",
"close": "-",
"high": "235.850",
"low": "235.050",
"highLimit": "241.725",
"lowLimit": "231.725",
"volume": "2",
"mdKey": "CBQ9-XCME-G",
"quoteCode": "CBQ9",
"escapedQuoteCode": "CBQ9",
"code": "CBQ9",
"updated": "11:27:33 CT<br /> 14 Aug 2019",
"percentageChange": "+0.34%",
"expirationMonth": "AUG 2019",
"expirationCode": "Q9",
"expirationDate": "20190801",
"productName": "Cash-settled Butter Futures",
"productCode": "CB",
"uri": "/trading/agricultural/dairy/cash-settled-butter.html",
"productId": 26,
"exchangeCode": "XCME",
"optionUri": "/trading/agricultural/dairy/cash-settled-butter_quotes_options.html",
"hasOption": true,
"lastTradeDate": {
"timestamp": 1567573200000,
"dateOnlyLongFormat": "04 Sep 2019",
"default24": "09/04/2019, 00:00:00 CDT",
"default12": "09/04/2019, 12:00:00 AM CDT",
"verbose": "September 04, 2019 12:00:00 AM CDT"
},
"priceChart": {
"enabled": true,
"code": "CB",
"monthYear": "Q9",
"venue": 1,
"title": "AUG_2019_Cash-settled_Butter_",
"year": 2019
},
"netChangeStatus": "statusOK",
"highLowLimits": "241.725 / 231.725"
},
...and so on.
{
“QuotedLayed”:正确,
“引用播放”:“10分钟”,
“交易日期”:“2019年8月14日”,
“引言”:[
{
“最后”:“235.850”,
“变更”:“+0.800”,
“priorSettle”:“235.050”,
“打开”:“235.050”,
“关闭”:“—”,
“高”:“235.850”,
“低”:“235.050”,
“上限”:“241.725”,
“低限”:“231.725”,
“卷”:“2”,
“mdKey”:“CBQ9-XCME-G”,
“报价代码”:“CBQ9”,
“escapedQuoteCode”:“CBQ9”,
“代码”:“CBQ9”,
“更新”:“2019年8月14日11:27:33 CT
”,
“百分比变化”:“+0.34%”,
“到期月份”:“2019年8月”,
“到期代码”:“Q9”,
“到期日期”:“20190801”,
“产品名称”:“以现金结算的黄油期货”,
“产品代码”:“CB”,
“uri”:“/trading/agricultural/dairy/cash-settled butter.html”,
“productId”:26,
“exchangeCode”:“XCME”,
“optionUri”:“/trading/agricultural/dairy/cash-settled-butter\u quotes\u options.html”,
“hasOption”:没错,
“lastTradeDate”:{
“时间戳”:156757300000,
“DateOnlyLong格式”:“2019年9月4日”,
“default24”:“2019年4月9日,CDT时间00:00:00”,
“default12”:“2019年4月9日,CDT上午12:00:00”,
“详细”:“2019年9月4日CDT上午12:00:00”
},
“价格表”:{
“启用”:正确,
“代码”:“CB”,
“monthYear”:“Q9”,
“地点”:1,
“标题”:“2019年8月\u现金结算\u黄油”,
“年份”:2019年
},
“netChangeStatus”:“statusOK”,
“高低限”:“241.725/231.725”
},
等等