Python 如何使用BeautifulSoup刮取隐藏的数据元素
Level2StockQuotes.com提供免费的实时顶级书籍引用,我希望使用BeautifulSoup在python中捕获这些引用。问题是,即使我可以在浏览器检查器中看到实际的数据值,我也无法将这些值刮到python中 BeautifulSoup返回每个数据元素为空的所有数据行。Pandas为每个数据元素返回一个带有NaN的数据帧Python 如何使用BeautifulSoup刮取隐藏的数据元素,python,beautifulsoup,Python,Beautifulsoup,Level2StockQuotes.com提供免费的实时顶级书籍引用,我希望使用BeautifulSoup在python中捕获这些引用。问题是,即使我可以在浏览器检查器中看到实际的数据值,我也无法将这些值刮到python中 BeautifulSoup返回每个数据元素为空的所有数据行。Pandas为每个数据元素返回一个带有NaN的数据帧 import bs4 as bs import urllib.request import pandas as pd symbol = 'AAPL' url =
import bs4 as bs
import urllib.request
import pandas as pd
symbol = 'AAPL'
url = 'https://markets.cboe.com/us/equities/market_statistics/book/'+ symbol + '/'
page = urllib.request.urlopen(url).read()
soup = bs.BeautifulSoup(page,'lxml')
rows = soup.find_all('tr')
print(rows)
for tr in rows:
td = tr.find_all('td')
row =(i.text for i in td)
print(row)
#using pandas to get dataframe
dfs = pd.read_html(url)
for df in dfs:
print(df)
有没有比我更有经验的人能告诉我如何提取这些数据?
谢谢 页面是动态的。您需要使用Selenium来模拟浏览器,并在获取html之前呈现页面,或者直接从json XHR获取数据
import requests
import pandas as pd
from pandas.io.json import json_normalize
url = 'https://markets.cboe.com/json/bzx/book/AAPL'
headers = {
'Referer': 'https://markets.cboe.com/us/equities/market_statistics/book/AAPL/',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36',
'X-Requested-With': 'XMLHttpRequest'}
jsonData = requests.get(url, headers=headers).json()
df_asks = pd.DataFrame(jsonData['data']['asks'], columns=['Shares','Price'] )
df_bids = pd.DataFrame(jsonData['data']['bids'], columns=['Shares','Price'] )
df_trades = pd.DataFrame(jsonData['data']['trades'], columns=['Time','Price','Shares','Time_ms'])
输出:
df_list = [df_asks, df_bids, df_trades]
for df in df_list:
print (df)
Shares Price
0 40 209.12
1 100 209.13
2 200 209.14
3 100 209.15
4 24 209.16
Shares Price
0 200 209.05
1 200 209.02
2 100 209.01
3 200 209.00
4 100 208.99
Time Price Shares Time_ms
0 10:45:57 300 209.0700 10:45:57.936000
1 10:45:57 300 209.0700 10:45:57.936000
2 10:45:55 29 209.1100 10:45:55.558000
3 10:45:52 45 209.0900 10:45:52.265000
4 10:45:52 50 209.0900 10:45:52.265000
5 10:45:52 5 209.0900 10:45:52.265000
6 10:45:51 100 209.1100 10:45:51.902000
7 10:45:48 100 209.1400 10:45:48.528000
8 10:45:48 100 209.1300 10:45:48.528000
9 10:45:48 200 209.1300 10:45:48.528000
页面是动态的。您需要使用Selenium来模拟浏览器,并在获取html之前呈现页面,或者直接从json XHR获取数据
import requests
import pandas as pd
from pandas.io.json import json_normalize
url = 'https://markets.cboe.com/json/bzx/book/AAPL'
headers = {
'Referer': 'https://markets.cboe.com/us/equities/market_statistics/book/AAPL/',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36',
'X-Requested-With': 'XMLHttpRequest'}
jsonData = requests.get(url, headers=headers).json()
df_asks = pd.DataFrame(jsonData['data']['asks'], columns=['Shares','Price'] )
df_bids = pd.DataFrame(jsonData['data']['bids'], columns=['Shares','Price'] )
df_trades = pd.DataFrame(jsonData['data']['trades'], columns=['Time','Price','Shares','Time_ms'])
输出:
df_list = [df_asks, df_bids, df_trades]
for df in df_list:
print (df)
Shares Price
0 40 209.12
1 100 209.13
2 200 209.14
3 100 209.15
4 24 209.16
Shares Price
0 200 209.05
1 200 209.02
2 100 209.01
3 200 209.00
4 100 208.99
Time Price Shares Time_ms
0 10:45:57 300 209.0700 10:45:57.936000
1 10:45:57 300 209.0700 10:45:57.936000
2 10:45:55 29 209.1100 10:45:55.558000
3 10:45:52 45 209.0900 10:45:52.265000
4 10:45:52 50 209.0900 10:45:52.265000
5 10:45:52 5 209.0900 10:45:52.265000
6 10:45:51 100 209.1100 10:45:51.902000
7 10:45:48 100 209.1400 10:45:48.528000
8 10:45:48 100 209.1300 10:45:48.528000
9 10:45:48 200 209.1300 10:45:48.528000
该页面使用AJAX请求来访问。从这个页面获取JSON格式的内容。从该页面获取JSON格式的内容。