Javascript Web抓取交互式图表
我看到有一些关于这方面的帖子,但每种情况显然都是独一无二的。我正在尝试获取本页图表背后的数据: 这是一个相当模糊的市场指数,无法通过雅虎获得,而雅虎正是我通常关注的地方(特别是python中的Javascript Web抓取交互式图表,javascript,python,html,web-scraping,Javascript,Python,Html,Web Scraping,我看到有一些关于这方面的帖子,但每种情况显然都是独一无二的。我正在尝试获取本页图表背后的数据: 这是一个相当模糊的市场指数,无法通过雅虎获得,而雅虎正是我通常关注的地方(特别是python中的web.DataReader),这是为数不多的几个有完整每日价格的地方之一 <script nonce="XL1oARYPz8X2tvqk"> window.__defaultsOverrides = { 'mainSeriesProperties.s
web.DataReader
),这是为数不多的几个有完整每日价格的地方之一
<script nonce="XL1oARYPz8X2tvqk">
window.__defaultsOverrides = {
'mainSeriesProperties.style': 3,
'mainSeriesProperties.areaStyle.priceSource': 'close',
'scalesProperties.lineColor': 'rgba( 76, 82, 94, 1)',
'scalesProperties.showSymbolLabels': false,
'scalesProperties.textColor': 'rgba( 76, 82, 94, 1)',
'scalesProperties.seriesLastValueMode': 0,
'paneProperties.topMargin': 13,
'paneProperties.legendProperties.showStudyArguments': false,
'paneProperties.legendProperties.showStudyTitles': false,
'paneProperties.legendProperties.showStudyValues': false,
'paneProperties.legendProperties.showSeriesTitle': false,
'paneProperties.legendProperties.showSeriesOHLC': true,
'paneProperties.legendProperties.showLegend': false,
};
</script>
窗口。\uuuu defaultsOverrides={
“MainSeriesProperty.style”:3,
“mainSeriesProperties.areaStyle.priceSource':“close”,
“scaleProperty.lineColor”:“rgba(76,82,94,1)”,
“ScaleProperties.showSymbolLabels”:false,
'scaleProperties.textColor':'rgba(76,82,94,1)',
“ScaleProperties.seriesLastValueMode”:0,
“paneProperties.topMargin”:13,
“paneProperties.legendProperties.showStudyArguments”:false,
“paneProperties.legendProperties.showStudyTitles”:false,
“paneProperties.legendProperties.ShowStudyValue”:false,
“paneProperties.legendProperties.ShowSerieStile”:false,
“paneProperties.legendProperties.ShowSeriesHolc”:true,
“paneProperties.legendProperties.showLegend”:false,
};
这就是显示为与图表相关的元素的内容,坦白地说,就web开发而言,这有点超出我的理解范围,因为它只是一个脚本标记(即,它不仅仅是图表元素的子元素,而是图表元素)。我尝试在JS文件中搜索XL1oARYPz8X2tvqk
的nonce值,但没有看到任何看起来会填充图表的内容
我想我可以在window对象的某个地方找到图表数据,但我没有看到它。有没有一个简单的方法来追踪这个?我知道我可以使用交互式刮板,但它似乎必须比这更简单。数据是从以下站点的websocket连接检索的:
wss://data.tradingview.com/socket.io/websocket?from=symbols%2FNASDAQ-VOLI%2F
您可以通过发送命令并从此websocket接收数据来获取这些数据。您可以看到从Chrome开发控制台接收和发送的所有消息:
格式是一个JSON对象流(每个响应可以有多个对象),带有一些前缀,如~m~23+~m~
。因此,有必要使用正则表达式(中间的数字变化)拆分响应
您可以在上面的屏幕截图中看到许多要发送的消息(绿色消息),但我们只对那些使用“图表会话令牌”的人感兴趣,例如控制图表而不是引用的命令
在开头发送以下消息:
{"m": "set_data_quality", "p": ["low"]},
{"m": "set_auth_token", "p": ["unauthorized_user_token"]},
{"m":"chart_create_session","p":[chartSession,""]},
{"m":"resolve_symbol","p":[chartSession,"symbol_1","={\"symbol\":\"NASDAQ:VOLI\",\"adjustment\":\"splits\",\"session\":\"extended\"}"]},
{"m":"create_series","p":[chartSession,"s1","s1","symbol_1","D",300]},
{"m":"switch_timezone","p":[chartSession,"Etc/UTC"]},
{"m":"resolve_symbol","p":[chartSession,"symbol_2","={\"symbol\":\"NASDAQ:VOLI\",\"adjustment\":\"splits\",\"session\":\"extended\"}"]},
{"m":"modify_series","p":[chartSession,"s1","s2","symbol_2","D,12M"]},
之后,您将收到一条带有值timescale\u update
以及图表数据等信息的响应
以下脚本启动websocket连接,发送获取图表数据所需的初始消息,并使用保存为png的图形构建图形:
import json
import websockets
import urllib
import asyncio
import re
from datetime import datetime
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
wsParams = {
"from": "symbols/NASDAQ-VOLI/"
}
websocketUri = f"wss://data.tradingview.com/socket.io/websocket?{urllib.parse.urlencode(wsParams)}"
result = []
chartSession = "cs_Dj1BV8ochLL0"
initMessages = [
{"m": "set_data_quality", "p": ["low"]},
{"m": "set_auth_token", "p": ["unauthorized_user_token"]},
{"m":"chart_create_session","p":[chartSession,""]},
{"m":"resolve_symbol","p":[chartSession,"symbol_1","={\"symbol\":\"NASDAQ:VOLI\",\"adjustment\":\"splits\",\"session\":\"extended\"}"]},
{"m":"create_series","p":[chartSession,"s1","s1","symbol_1","D",300]},
{"m":"switch_timezone","p":[chartSession,"Etc/UTC"]},
{"m":"resolve_symbol","p":[chartSession,"symbol_2","={\"symbol\":\"NASDAQ:VOLI\",\"adjustment\":\"splits\",\"session\":\"extended\"}"]},
{"m":"modify_series","p":[chartSession,"s1","s2","symbol_2","D,12M"]},
]
def strip(text):
noDataReg = re.match('~m~\d+~m~~h~\d+', text, re.MULTILINE)
if not noDataReg:
dataReg = re.split('~m~\d+~m~', text)
return [json.loads(t) for t in dataReg if t]
return []
def unstrip(text):
return f"~m~{len(text)-8}~m~{json.dumps(text)}"
async def init(websocket):
for m in initMessages:
await websocket.send(unstrip(m))
async def startReceiving(websocket):
data = await websocket.recv()
print(strip(data))
await init(websocket)
while(True):
data = await websocket.recv()
payloads = strip(data)
for p in payloads:
if p["m"] == "timescale_update":
dates = [
datetime.fromtimestamp(t["v"][0])
for t in p["p"][1]["s1"]["s"]
]
values = [
t["v"][4]
for t in p["p"][1]["s1"]["s"]
]
plt.gca().xaxis.set_major_formatter(mdates.DateFormatter('%d/%m/%Y'))
plt.gca().xaxis.set_major_locator(mdates.DayLocator(interval=25))
plt.plot(dates, values)
plt.gcf().autofmt_xdate()
plt.ylabel('VOLI Index Chart')
plt.xlabel('Date')
plt.savefig("voli.png")
print(payloads)
async def websocketConnect():
async with websockets.client.connect(websocketUri, extra_headers= {
"Origin": "https://www.tradingview.com"
}) as websocket:
print(f'started websocket')
await startReceiving(websocket)
asyncio.get_event_loop().run_until_complete(websocketConnect())
以及生成的图表:
请注意:
- 为了成功连接到websocket服务器,您需要发送带有正确值的
头,否则返回403Origin
- 图表会话令牌在这里是硬编码的,但它可以是任何东西,它似乎是在网站上随机生成的(使用正则表达式模式)
- 我已删除所有关于引号的websocket消息,您需要添加此类消息以接收有关“实时”值更改的通知(将添加到init消息):
quote\u create\u session
对于新的会话令牌(!=来自图表会话令牌)是必需的。然后您将通过websocket接收通知
- 如果您想接收通知,请注意有一个keep-alive,如果您在x段时间内没有发送任何内容,它会自动关闭websocket。您只需定期发送以下命令:
~m~4~m~~h~1
wss://data.tradingview.com/socket.io/websocket?from=symbols%2FNASDAQ-VOLI%2F
您可以通过发送命令并从此websocket接收数据来获取这些数据。您可以看到从Chrome开发控制台接收和发送的所有消息:
格式是一个JSON对象流(每个响应可以有多个对象),带有一些前缀,如~m~23+~m~
。因此,有必要使用正则表达式(中间的数字变化)拆分响应
您可以在上面的屏幕截图中看到许多要发送的消息(绿色消息),但我们只对那些使用“图表会话令牌”的人感兴趣,例如控制图表而不是引用的命令
在开头发送以下消息:
{"m": "set_data_quality", "p": ["low"]},
{"m": "set_auth_token", "p": ["unauthorized_user_token"]},
{"m":"chart_create_session","p":[chartSession,""]},
{"m":"resolve_symbol","p":[chartSession,"symbol_1","={\"symbol\":\"NASDAQ:VOLI\",\"adjustment\":\"splits\",\"session\":\"extended\"}"]},
{"m":"create_series","p":[chartSession,"s1","s1","symbol_1","D",300]},
{"m":"switch_timezone","p":[chartSession,"Etc/UTC"]},
{"m":"resolve_symbol","p":[chartSession,"symbol_2","={\"symbol\":\"NASDAQ:VOLI\",\"adjustment\":\"splits\",\"session\":\"extended\"}"]},
{"m":"modify_series","p":[chartSession,"s1","s2","symbol_2","D,12M"]},
之后,您将收到一条带有值timescale\u update
以及图表数据等信息的响应
以下脚本启动websocket连接,发送获取图表数据所需的初始消息,并使用保存为png的图形构建图形:
import json
import websockets
import urllib
import asyncio
import re
from datetime import datetime
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
wsParams = {
"from": "symbols/NASDAQ-VOLI/"
}
websocketUri = f"wss://data.tradingview.com/socket.io/websocket?{urllib.parse.urlencode(wsParams)}"
result = []
chartSession = "cs_Dj1BV8ochLL0"
initMessages = [
{"m": "set_data_quality", "p": ["low"]},
{"m": "set_auth_token", "p": ["unauthorized_user_token"]},
{"m":"chart_create_session","p":[chartSession,""]},
{"m":"resolve_symbol","p":[chartSession,"symbol_1","={\"symbol\":\"NASDAQ:VOLI\",\"adjustment\":\"splits\",\"session\":\"extended\"}"]},
{"m":"create_series","p":[chartSession,"s1","s1","symbol_1","D",300]},
{"m":"switch_timezone","p":[chartSession,"Etc/UTC"]},
{"m":"resolve_symbol","p":[chartSession,"symbol_2","={\"symbol\":\"NASDAQ:VOLI\",\"adjustment\":\"splits\",\"session\":\"extended\"}"]},
{"m":"modify_series","p":[chartSession,"s1","s2","symbol_2","D,12M"]},
]
def strip(text):
noDataReg = re.match('~m~\d+~m~~h~\d+', text, re.MULTILINE)
if not noDataReg:
dataReg = re.split('~m~\d+~m~', text)
return [json.loads(t) for t in dataReg if t]
return []
def unstrip(text):
return f"~m~{len(text)-8}~m~{json.dumps(text)}"
async def init(websocket):
for m in initMessages:
await websocket.send(unstrip(m))
async def startReceiving(websocket):
data = await websocket.recv()
print(strip(data))
await init(websocket)
while(True):
data = await websocket.recv()
payloads = strip(data)
for p in payloads:
if p["m"] == "timescale_update":
dates = [
datetime.fromtimestamp(t["v"][0])
for t in p["p"][1]["s1"]["s"]
]
values = [
t["v"][4]
for t in p["p"][1]["s1"]["s"]
]
plt.gca().xaxis.set_major_formatter(mdates.DateFormatter('%d/%m/%Y'))
plt.gca().xaxis.set_major_locator(mdates.DayLocator(interval=25))
plt.plot(dates, values)
plt.gcf().autofmt_xdate()
plt.ylabel('VOLI Index Chart')
plt.xlabel('Date')
plt.savefig("voli.png")
print(payloads)
async def websocketConnect():
async with websockets.client.connect(websocketUri, extra_headers= {
"Origin": "https://www.tradingview.com"
}) as websocket:
print(f'started websocket')
await startReceiving(websocket)
asyncio.get_event_loop().run_until_complete(websocketConnect())
以及生成的图表:
请注意:
- 为了成功连接到websocket服务器,您需要发送带有正确值的
头,否则返回403Origin
- 图表会话令牌在这里是硬编码的,但它可以是任何东西,它似乎是在网站上随机生成的(使用正则表达式模式)
- 我已删除所有关于引号的websocket消息,您需要添加此类消息以接收有关“实时”值更改的通知(将添加到init消息):
quote\u create\u session
对于新的会话令牌(!=来自图表会话令牌)是必需的。然后您将通过websocket接收通知
- 如果您想接收通知,请注意有一个keep-alive,如果您在x段时间内没有发送任何内容,它会自动关闭websocket。您只需定期发送以下命令:
~m~4~m~~h~1