Python 如何使用beautifulsoup解析标记中的数据？_Python_Beautifulsoup

Python 如何使用beautifulsoup解析标记中的数据？

python

Python 如何使用beautifulsoup解析标记中的数据？,python,beautifulsoup,Python,Beautifulsoup,当我试图从以下网站获取数据时 url= 我从bedbathbeyond网站上得到了这个，如果我使用request和beautifulsoup，我什么都得不到。为什么呢代码：返回值为空：[]您可以使用selenium webdriver来获取感兴趣的html内容。比如说, from selenium import webdriver def get_html(url): driver = webdriver.Chrome() driver.maximize_window()

当我试图从以下网站获取数据时

url=

我从bedbathbeyond网站上得到了这个，如果我使用request和beautifulsoup，我什么都得不到。为什么呢

代码：

返回值为空：[]

您可以使用selenium webdriver来获取感兴趣的html内容。比如说,

from selenium import webdriver


def get_html(url):
    driver = webdriver.Chrome()
    driver.maximize_window()
    driver.get(url)

    time.sleep(5)
    html_content = driver.page_source.strip()
    return html_content

您可以使用SeleniumWebDriver来获取感兴趣的html内容。比如说,

from selenium import webdriver


def get_html(url):
    driver = webdriver.Chrome()
    driver.maximize_window()
    driver.get(url)

    time.sleep(5)
    html_content = driver.page_source.strip()
    return html_content

我使用了，因为materials对象包含几个键BVRRRatingSummarySourceID、BVRRSecondaryRatingSummarySourceID和BVRRSourceID，如果您需要的话，使用正则表达式从其值中获取HTML会更加困难

from bs4 import BeautifulSoup
import js2py
import requests

r = requests.get('https://bedbathandbeyond.ugc.bazaarvoice.com/2009-en_us/1061083288/reviews.djs?format=embeddedhtml')

pattern = (r'var'
           r'\s+'
           r'materials'
           r'\s*=\s*'
           r'{"BVRRRatingSummarySourceID".*}')

js_materials = re.search(pattern, r.text).group()
obj = js2py.eval_js(js_materials).to_dict()
html = obj['BVRRSourceID']
soup = BeautifulSoup(html, 'lxml')
spans = soup.select('span.BVRRReviewAbbreviatedText')

在下面的示例中，我只使用了BVRRSourceID键下的HTML，但您可以通过将值连接在一起来使用整个HTML：

html = ''.join(obj.values())

不要忘记安装js2py:pip install js2py和pip install lxml，如果您想使用lxml解析器。

我使用过，因为materials对象包含几个键bvrratingsummarysourceid、bvrsecondaryrratingsummarysourceid和BVRRSourceID，如果您需要的话，使用regex从其值获取HTML会困难得多

from bs4 import BeautifulSoup
import js2py
import requests

r = requests.get('https://bedbathandbeyond.ugc.bazaarvoice.com/2009-en_us/1061083288/reviews.djs?format=embeddedhtml')

pattern = (r'var'
           r'\s+'
           r'materials'
           r'\s*=\s*'
           r'{"BVRRRatingSummarySourceID".*}')

js_materials = re.search(pattern, r.text).group()
obj = js2py.eval_js(js_materials).to_dict()
html = obj['BVRRSourceID']
soup = BeautifulSoup(html, 'lxml')
spans = soup.select('span.BVRRReviewAbbreviatedText')

在下面的示例中，我只使用了BVRRSourceID键下的HTML，但您可以通过将值连接在一起来使用整个HTML：

html = ''.join(obj.values())

不要忘记安装js2py:pip install js2py和pip install lxml，如果您想使用lxml解析器。

这是因为HTML在AJAX调用中，所以BeautifulSoup无法解析内容。这是因为HTML在AJAX调用中，所以BeautifulSoup无法解析内容。您好，感谢您的回答。将结果保存到变量后，假设a=get\u htmlur，然后我尝试使用Beautifulsoup:soup=beautifulsopa、'lxml'，然后是'soup.find\u all'span'，class='bvrreviewtext'解析它，但仍然无法检索任何内容。为什么？嗨，谢谢你的回答。将结果保存到变量后，假设a=get\u htmlur，然后我尝试使用Beautifulsoup:soup=beautifulsopa、'lxml'，然后是'soup.find\u all'span'，class='bvrreviewtext'解析它，但仍然无法检索任何内容。为什么？即使我不太明白答案的某些部分，它还是起作用了！谢谢！你可以阅读有关正则表达式的内容。即使我不太理解答案的某些部分，它还是有效的！谢谢！您可以阅读有关正则表达式的内容。