Python 靓汤特色未发现问题_Python_Python 3.x_Beautifulsoup

Python 靓汤特色未发现问题

python python-3.x

Python 靓汤特色未发现问题,python,python-3.x,beautifulsoup,Python,Python 3.x,Beautifulsoup,因此，我对学习BeautifulSoup还不熟悉，但我有点困惑，为什么会出现这种情况： import requests from bs4 import BeautifulSoup r = requests.get('https://ca.finance.yahoo.com/quote/AMZN/profile?p=AMZN') soup = BeautifulSoup(r.content, 'html.parser') price = soup.find_all('div', {'class'

因此，我对学习BeautifulSoup还不熟悉，但我有点困惑，为什么会出现这种情况：

import requests
from bs4 import BeautifulSoup

r = requests.get('https://ca.finance.yahoo.com/quote/AMZN/profile?p=AMZN')
soup = BeautifulSoup(r.content, 'html.parser')
price = soup.find_all('div', {'class':'My(6px) Pos(r) smartphone_Mt(6px)'})
print(price)

我的代码有错误吗？或者BeautifulSoup没有获取网站的代码吗？此外，每当我尝试使用“xml”或“lxml”之类的内容而不是“html.parser”时，都会出现如下错误：

[]

正如@S.D.和@xeon zolt所建议的，问题似乎在于您搜索的内容是由脚本生成的。为了让Beauty Soup解析这个，我们必须使用浏览器加载网页，然后将页面源代码传递给Beauty Soup

根据您的评论，我认为您已经设置了Selenium。您可以在Selenium中加载页面，然后将页面源代码传递给Beauty soup，如下所示：

bs4.FeatureNotFound: Couldn't find a tree builder with the features you requested: xml. Do you need to install a parser library?

此外，这意味着在运行脚本时，可见的ui元素（例如浏览器打开然后关闭）不可见。通过修改代码以包括以下内容，可以使用headless模式：

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait

driver = webdriver.Firefox()

driver.get("https://ca.finance.yahoo.com/quote/AMZN/profile?p=AMZN")

wait = WebDriverWait(driver, 5)

page_source = driver.page_source

driver.close()

soup = BeautifulSoup(page_source, 'html.parser')

要回答最后一个问题，在使用新的解析器之前，您必须安装它。例如，如果要使用lxml解析器，应首先在命令行中运行：

from selenium.webdriver.firefox.options import Options

options = Options()
options.headless = True

driver = webdriver.Firefox(options=options)

希望这有帮助

数据内部存储在JavaScript变量中。您可以使用

re

和

json

模块提取信息

例如：

$ pip install lxml

印刷品：

import re
import json
import requests

url = 'https://ca.finance.yahoo.com/quote/AMZN/profile?p=AMZN'

html_data = requests.get(url).text

data = json.loads(re.search(r'root\.App\.main = ({.*?});\n', html_data).group(1))

# uncomment this to print all data:
# print(json.dumps(data, indent=4))

price = data['context']['dispatcher']['stores']['QuoteSummaryStore']['price']['regularMarketPrice']['fmt']
currency_symbol = data['context']['dispatcher']['stores']['QuoteSummaryStore']['price']['currencySymbol']

print('{} {}'.format(price, currency_symbol))

通过在终端或命令提示符下运行以下命令，可以切换到LXML解析器：

2,436.88 $

然后试试这个：

pip install lxml

有关详细信息：

我不太了解beautiful soup，但我找不到相同的类，但我可以在页面源代码中看到它，在浏览器中也可以使用data reactid=“29”我找不到相同的div。我可以在web源代码中找到代码，但由于某种原因，我无法在实际代码中找到它。beautifulsoup中的源代码似乎与浏览器上的源代码不同。您搜索“My（6px）Pos（r）smartphone_Mt（6px）”类有什么原因吗？它似乎是其他div的父div，其中包含要刮取的信息。看起来您要查找的内容是JS生成的。这不适用于请求。您应该看看Selenium，它可以与实际的浏览器一起工作。@DesmondCheong我正在搜索它，因为我试图使搜索更准确，但没有成功。所以我试着去寻找那个级别的外层分区。我明白了！好的，基本上我可以在headless模式下使用所有的selenium，但我只需要编写你发布的代码，这样我就可以在headless模式下正常工作了？另外，最好不要将页面源代码从selenium传递到beautifulsoup，而是继续使用selenium本身进行web废弃？是的，您只需要将headless选项传递给selenium驱动程序。如果您对使用Selenium解析html感到满意，那么您肯定可以继续使用Selenium本身进行web抓取。从某种意义上说，不必使用额外的库可能更好。虽然，我个人觉得这个漂亮的汤界面非常好用。我也看到了@Andrej Kesely的答案。这是另一个非常好的探索途径，因为它直接从页面中包含的javascript获取内容。这就消除了对硒的需求，因为硒可能会有点重。

soup = BeautifulSoup(html, "lxml")