Python 在Google Colab中抓取动态Javascript网站_Python_Pandas_Selenium_Beautifulsoup_Google Colaboratory

Python 在Google Colab中抓取动态Javascript网站

python pandas selenium google-colaboratory

Python 在Google Colab中抓取动态Javascript网站,python,pandas,selenium,beautifulsoup,google-colaboratory,Python,Pandas,Selenium,Beautifulsoup,Google Colaboratory,如果在Google Colab中使用Python 3将网站定义为，我可以从该网站读取Javascript数据，如下所示：导入请求作为pd进口熊猫 url='1〕https://datatables.net/extensions/buttons/examples/html5/simple.html' df=pd.read\uHTML（requests.get（url.text）[0] 打印（df）然而，我同样希望将更复杂的Javscript站点中的数据直接读取到GoogleCoolab中的

如果在Google Colab中使用Python 3将网站定义为

，我可以从该网站读取Javascript数据，如下所示：

导入请求
作为pd进口熊猫
url='1〕https://datatables.net/extensions/buttons/examples/html5/simple.html'
df=pd.read\uHTML（requests.get（url.text）[0]
打印（df）

然而，我同样希望将更复杂的Javscript站点中的数据直接读取到GoogleCoolab中的Python3中。此数据不能定义为

格式

例如，我想看看哪些日期“售罄”，哪些日期不在以下网站上：

可用日期（蓝色）和售罄日期（红色）之间的差异为

我曾尝试在Colab的Python3中使用硒、BeautifulSoup和Pandas的组合，但没有成功。

当你分析网站时，网站需要时间加载。但一旦加载，你就可以分析网站发出的网络呼叫。该网站会发出一个ajax调用，以加载有关soldout日期的所有数据

import requests, json
from bs4 import BeautifulSoup

payload = {
    "inventoryPoolCode": 8452,
    "duration": 1,
    "quantity": 1,
    "productDate": "9 August, 2020"
}

headers = {
    "content-type": "application/x-www-form-urlencoded; charset=UTF-8"
}

res = requests.post("https://shop.perisher.com.au/ProductCalendar/Index", data=payload,headers=headers)
soup = BeautifulSoup(res.text, "html.parser")

data = {'soldout':[], 'notsoldout':[]}
for span in soup.find_all("span", class_="grid-item"):
    if "empty" in span["class"]: continue
    date = span["data-date"].strip()
    if "soldout" in span['class']:data['soldout'].append(date)
    else: data['notsoldout'].append(date)

print("Sold Out Dates")
print(data["soldout"])
print("---" * 25)
print("Available Dates")
print(data["notsoldout"])

输出：

Sold Out Dates
['2020-08-08', '2020-08-09', '2020-08-10', '2020-08-11', '2020-08-13', '2020-08-14', '2020-08-15', '2020-08-16', '2020-08-17', '2020-08-18', '2020-08-20', '2020-08-22', '2020-08-23', '2020-08-29', '2020-08-30']
---------------------------------------------------------------------------
Available Dates
['2020-08-12', '2020-08-19', '2020-08-21', '2020-08-24', '2020-08-25', '2020-08-26', '2020-08-27', '2020-08-28']

分析网站时，网站需要时间加载。但一旦加载，你就可以分析网站发出的网络呼叫。该网站会发出一个ajax调用，以加载有关soldout日期的所有数据

import requests, json
from bs4 import BeautifulSoup

payload = {
    "inventoryPoolCode": 8452,
    "duration": 1,
    "quantity": 1,
    "productDate": "9 August, 2020"
}

headers = {
    "content-type": "application/x-www-form-urlencoded; charset=UTF-8"
}

res = requests.post("https://shop.perisher.com.au/ProductCalendar/Index", data=payload,headers=headers)
soup = BeautifulSoup(res.text, "html.parser")

data = {'soldout':[], 'notsoldout':[]}
for span in soup.find_all("span", class_="grid-item"):
    if "empty" in span["class"]: continue
    date = span["data-date"].strip()
    if "soldout" in span['class']:data['soldout'].append(date)
    else: data['notsoldout'].append(date)

print("Sold Out Dates")
print(data["soldout"])
print("---" * 25)
print("Available Dates")
print(data["notsoldout"])

输出：

Sold Out Dates
['2020-08-08', '2020-08-09', '2020-08-10', '2020-08-11', '2020-08-13', '2020-08-14', '2020-08-15', '2020-08-16', '2020-08-17', '2020-08-18', '2020-08-20', '2020-08-22', '2020-08-23', '2020-08-29', '2020-08-30']
---------------------------------------------------------------------------
Available Dates
['2020-08-12', '2020-08-19', '2020-08-21', '2020-08-24', '2020-08-25', '2020-08-26', '2020-08-27', '2020-08-28']

你几乎可以解析任何网站。。。。我刚刚将此页面加载到BeautifulSoup中，javascript构建的内容不可用。这意味着您需要使用selenium之类的工具，有效地引导浏览器加载，然后才能访问DOM，您可以解析几乎任何网站。。。。我刚刚将此页面加载到BeautifulSoup中，javascript构建的内容不可用。这意味着您将需要使用类似于selenium的东西来有效地引导浏览器加载，然后才能访问DOM