Python 而我使用bs4来解析站点_Python_Parsing_Beautifulsoup

Python 而我使用bs4来解析站点

python parsing

Python 而我使用bs4来解析站点,python,parsing,beautifulsoup,Python,Parsing,Beautifulsoup,我想使用bs4解析Bitmex中的价格信息（网站url为“”）所以，我写下了这样的代码 from bs4 import BeautifulSoup import requests url = 'https://www.bitmex.com/app/trade/XBTUSD' bitmex = requests.get(url) if bitmex.status_code == 200: print("connected...") else: print("Error...

我想使用bs4解析Bitmex中的价格信息

（网站url为“”）

所以，我写下了这样的代码

from bs4 import BeautifulSoup
import requests

url = 'https://www.bitmex.com/app/trade/XBTUSD'
bitmex = requests.get(url)

if bitmex.status_code == 200:
    print("connected...")
else:
    print("Error...")

bitmex_html = bitmex.text
soup = BeautifulSoup(bitmex_html , 'lxml' )
price = soup.find_all("span", {"class": "price"})
print(price)

connected...
[]

结果是这样的

from bs4 import BeautifulSoup
import requests

url = 'https://www.bitmex.com/app/trade/XBTUSD'
bitmex = requests.get(url)

if bitmex.status_code == 200:
    print("connected...")
else:
    print("Error...")

bitmex_html = bitmex.text
soup = BeautifulSoup(bitmex_html , 'lxml' )
price = soup.find_all("span", {"class": "price"})
print(price)

connected...
[]

为什么会弹出“[]”？要带来价格文本，如“6065.5”，我应该怎么做？我要分析的文本是

<span class="price">6065.5</span>

6065.5

选择器是

content>div>div.tickerBar.overflown>div>span.instruments.tickerBarSection>span:n子项（1）>span.price

我只是学习Python，所以这个问题对专业人士来说似乎很奇怪……对不起，你们很接近了。试试下面的方法，看看是否更符合你的要求。可能您看到或检索的格式与您期望的不完全相同。希望这是有帮助的

from bs4 import BeautifulSoup
import requests
import sys
import json

url = 'https://www.bitmex.com/app/trade/XBTUSD'
bitmex = requests.get(url)

if bitmex.status_code == 200:
    print("connected...")
else:
    print("Error...")
    sys.exit(1)

bitmex_html = bitmex.text
soup = BeautifulSoup(bitmex_html , 'lxml' )

# extract the json text from the returned page
price = soup.find_all("script", {"id": "initialData"})
price = price.pop()

# parse json text
d = json.loads(price.text)

# pull out the order book and then each price listed in the order book
order_book = d['orderBook']
prices = [v['price'] for v in order_book]
print(prices)

示例输出：

connected...
[6045, 6044.5, 6044, 6043.5, 6043, 6042.5, 6042, 6041.5, 6041, 6040.5, 6040, 6039.5, 6039, 6038.5, 6038, 6037.5, 6037, 6036.5, 6036, 6035.5, 6035, 6034.5, 6034, 6033.5, 6033, 6032.5, 6032, 6031.5, 6031, 6030.5, 6030, 6029.5, 6029, 6028.5, 6028, 6027.5, 6027, 6026.5, 6026, 6025.5, 6025, 6024.5, 6024, 6023.5, 6023, 6022.5, 6022, 6021.5, 6021, 6020.5]

你很接近。试试下面的方法，看看是否更符合你的要求。可能您看到或检索的格式与您期望的不完全相同。希望这是有帮助的

from bs4 import BeautifulSoup
import requests
import sys
import json

url = 'https://www.bitmex.com/app/trade/XBTUSD'
bitmex = requests.get(url)

if bitmex.status_code == 200:
    print("connected...")
else:
    print("Error...")
    sys.exit(1)

bitmex_html = bitmex.text
soup = BeautifulSoup(bitmex_html , 'lxml' )

# extract the json text from the returned page
price = soup.find_all("script", {"id": "initialData"})
price = price.pop()

# parse json text
d = json.loads(price.text)

# pull out the order book and then each price listed in the order book
order_book = d['orderBook']
prices = [v['price'] for v in order_book]
print(prices)

示例输出：

connected...
[6045, 6044.5, 6044, 6043.5, 6043, 6042.5, 6042, 6041.5, 6041, 6040.5, 6040, 6039.5, 6039, 6038.5, 6038, 6037.5, 6037, 6036.5, 6036, 6035.5, 6035, 6034.5, 6034, 6033.5, 6033, 6032.5, 6032, 6031.5, 6031, 6030.5, 6030, 6029.5, 6029, 6028.5, 6028, 6027.5, 6027, 6026.5, 6026, 6025.5, 6025, 6024.5, 6024, 6023.5, 6023, 6022.5, 6022, 6021.5, 6021, 6020.5]

您的问题是页面首先不包含那些

span

元素。如果您在浏览器开发工具（firefox中按F12）中选中

response

选项卡，您可以看到页面由

script

标记组成，其中一些代码是用javascript编写的，在执行时动态创建元素

由于BeautifulSoup无法执行Javascript，因此无法直接使用它提取元素。你有两个选择：

使用类似于
```
selenium
```
的东西，允许您从python驱动浏览器-这意味着javascript将被执行，因为您使用的是真实的浏览器-但是性能会受到影响

阅读javascript代码，理解它并编写python代码来模拟它。这通常比较困难，但幸运的是，对于您想要的页面来说，这似乎非常简单：

import requests
import lxml.html

r = requests.get('https://www.bitmex.com/app/trade/XBTUSD')
doc = lxml.html.fromstring(r.text)
data = json.loads(doc.xpath("//script[@id='initialData']/text()")[0])

for row in data['orderBook']:
    print(row['symbol'], row['price'], row['side'])

正如您所看到的，数据在页面中是json格式的。加载数据变量后，您可以使用它访问所需的信息：

import requests
import lxml.html

r = requests.get('https://www.bitmex.com/app/trade/XBTUSD')
doc = lxml.html.fromstring(r.text)
data = json.loads(doc.xpath("//script[@id='initialData']/text()")[0])

for row in data['orderBook']:
    print(row['symbol'], row['price'], row['side'])

将打印：

('XBTUSD', 6051.5, 'Sell')
('XBTUSD', 6051, 'Sell')
('XBTUSD', 6050.5, 'Sell')
('XBTUSD', 6050, 'Sell')

您的问题是页面首先不包含那些

span

元素。如果您在浏览器开发工具（firefox中按F12）中选中

response

选项卡，您可以看到页面由

script

标记组成，其中一些代码是用javascript编写的，在执行时动态创建元素

由于BeautifulSoup无法执行Javascript，因此无法直接使用它提取元素。你有两个选择：

使用类似于
```
selenium
```
的东西，允许您从python驱动浏览器-这意味着javascript将被执行，因为您使用的是真实的浏览器-但是性能会受到影响

阅读javascript代码，理解它并编写python代码来模拟它。这通常比较困难，但幸运的是，对于您想要的页面来说，这似乎非常简单：

import requests
import lxml.html

r = requests.get('https://www.bitmex.com/app/trade/XBTUSD')
doc = lxml.html.fromstring(r.text)
data = json.loads(doc.xpath("//script[@id='initialData']/text()")[0])

for row in data['orderBook']:
    print(row['symbol'], row['price'], row['side'])

正如您所看到的，数据在页面中是json格式的。加载数据变量后，您可以使用它访问所需的信息：

import requests
import lxml.html

r = requests.get('https://www.bitmex.com/app/trade/XBTUSD')
doc = lxml.html.fromstring(r.text)
data = json.loads(doc.xpath("//script[@id='initialData']/text()")[0])

for row in data['orderBook']:
    print(row['symbol'], row['price'], row['side'])

将打印：

('XBTUSD', 6051.5, 'Sell')
('XBTUSD', 6051, 'Sell')
('XBTUSD', 6050.5, 'Sell')
('XBTUSD', 6050, 'Sell')

@KunC selenium有其优点和缺点。如果javascript足够简单，那么最好编写python代码来模拟它，因为它的性能可能比启动一个成熟的web浏览器并执行javascript要好。对于您的问题中的bitmex页面，我不会使用selenium，因为数据已经很容易以javascript/json格式获得，python可以轻松读取json。非常感谢你的帮助。我很羡慕你能从Bitmex网站轻松获取信息……哈哈：）@KunC selenium有优点也有缺点。如果javascript足够简单，那么最好编写python代码来模拟它，因为它的性能可能比启动一个成熟的web浏览器并执行javascript要好。对于您的问题中的bitmex页面，我不会使用selenium，因为数据已经很容易以javascript/json格式获得，python可以轻松读取json。非常感谢你的帮助。我很羡慕你能从Bitmex网站轻松获取信息…哈哈：）嗯…在阅读了你的答案后，我不知道在“''price=soup.find_all”（“script“，{“id”：“initialData”}）”之后。你能详细解释一下吗？事实上，我只想要现在的价格…很抱歉打扰你。一点也不麻烦。也许当你在浏览器中查看网页时，你只看到一个价格，但是当我们试图解析它时，我们发现数据实际上是一个javascript结构。我们可以用json加载它，但是发现有不止一个价格。你需要检查一下你想要哪一个。也许拥有所有的价格更强大。打印整个订购手册以查看完整内容。祝你成功。Bitmex似乎是一个有趣的服务。我在Chrome中使用Devtool，但在元素选项卡中找不到“script”{“id”：“initialData”}。我在哪里能找到？find_all（“script”，“id”：“initialData”}）的意思是“查找所有标记名为“script”，id为“initialData”的东西”。我说得对吗？另外，我不知道“.pop”是什么…试着打印（soup）看看python脚本接收到了什么，在那里你会看到对initialData的引用。Find_all（）以列表的形式查找并返回您描述的所有事件，pop（）返回列表中的第一个元素。用.find（…）代替可能就足够了。啊哈！现在我明白我为什么困惑了。当我在Chrome中打开DevTools（按F12键）时，我在元素选项卡中找不到“id:initialData”，但当我使用bs4在python中解析时可以找到它。同样的内容也在Sources选项卡中。嗯……在阅读了你的答案之后，我不知道在“''price=soup.find_all”（“script”，{“id”：“initialData”}）”之后。你能详细解释一下吗？事实上，我只想要现在的价格…很抱歉打扰你。一点也不麻烦。也许当你在浏览器中查看网页时，你只看到一个价格，但是当我们试图解析它时，我们发现数据实际上是在javascript结构中