Warning: file_get_contents(/data/phpspider/zhask/data//catemap/1/cassandra/3.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 如何从<;span>;嵌套在<;李>;嵌套在<;ul>;使用BeautifulSoup?_Python_Html_Web Scraping_Beautifulsoup - Fatal编程技术网

Python 如何从<;span>;嵌套在<;李>;嵌套在<;ul>;使用BeautifulSoup?

Python 如何从<;span>;嵌套在<;李>;嵌套在<;ul>;使用BeautifulSoup?,python,html,web-scraping,beautifulsoup,Python,Html,Web Scraping,Beautifulsoup,我想摘录以下是部分的最新内容,从接下来的几周开始,到全面增强结束 检查代码我看到嵌套在下,然后嵌套在

我想摘录以下是部分的最新内容,从接下来的几周开始,到全面增强结束

检查代码我看到
嵌套在
  • 下,然后嵌套在
      。在过去的几天里,我试图用Python3和
      BeautifulSoup
      来提取它,但没有成功。我正在粘贴我在下面尝试过的代码

      有人能帮我指引正确的方向吗

      一,#

      二,#

      理想情况下,代码应该返回:

      在接下来的几周里,你只需点击“出发前”对话框,就可以阅读自己拥有的物品

      性能改进、错误修复和其他常规增强

      但他们都没给我什么。看起来它找不到具有该ID的
      ul
      ,但如果您
      打印(汤)
      一切看起来都很好:

      <ul id="GUID-8B03C49D-3A98-45F1-9128-392E55823F61__UL_E0490B159DE04E22AD519CE2E7D7A35B">
      <li>
      <span class="a-list-item"><span><strong>Read Now</strong></span>: In the coming weeks, you will be able to read items that you own with a single click from the �Before You Go� dialog.</span></li>
      
      <li>
      <span class="a-list-item">Performance improvements, bug fixes, and other general enhancements.<br></li>
      
      
      </ul>
      
      • 立即阅读:在接下来的几周里,您只需从�在你走之前� 对话
      • 性能改进、错误修复和其他常规增强功能。

      首先,页面是动态呈现的,因此您必须使用
      selenium
      来正确获取页面内容

      第二,你可以找到
      p
      标签,这里的新内容出现在这里,最后得到下一个
      ul
      标签

      代码如下:

      from bs4 import BeautifulSoup as soup
      from selenium import webdriver
      
      url = "https://www.amazon.com/gp/help/customer/display.html/ref=hp_left_v4_sib?ie=UTF8&nodeId=G54HPVAW86CHYHKS"
      
      driver = webdriver.Firefox()
      
      page = driver.get(url)
      
      html = soup(driver.page_source, 'html.parser')
      
      for p in html.find_all('p'):
          if p.text and "Here’s what’s new" in p.text:
              ul = p.find_next_sibling('ul')
              for li in ul.find_all('li'):
                  print(li.text)
      
      输出:

      Read Now: In the coming weeks, you will be able to read items that you own with a single click from the ‘Before You Go’ dialog.
      
      Performance improvements, bug fixes, and other general enhancements.
      

      对于bs4 4.7.1+,您可以使用:contains和:has来隔离

      import requests
      from bs4 import BeautifulSoup as bs
      
      r = requests.get('https://www.amazon.com/gp/help/customer/display.html/ref=hp_left_v4_sib?ie=UTF8&nodeId=G54HPVAW86CHYHKS')
      soup = bs(r.content, 'lxml')
      text = [i.text.strip() for i in soup.select('p:has(strong:contains("Here’s what’s new:")), p:has(strong:contains("Here’s what’s new:")) + p + ul li')]
      print(text)
      

      目前,您还可以删除
      :contains

      text = [i.text.strip() for i in soup.select('p:has(strong), p:has(strong) + p + ul li')]
      print(text)
      
      +是一个css相邻兄弟组合符。阅读更多。引述:

      相邻兄弟组合子

      +组合符选择相邻的同级。这意味着第二个元素直接跟随 首先,两者共享同一父项

      语法:A+B

      示例:
      h2+p
      将匹配直接跟在
    后面的所有
    元素


    我相信您的解决方案是可以的,但是由于
    selenium
    及其驱动程序存在许多问题,我无法让它在我的机器上运行。
    import requests
    from bs4 import BeautifulSoup as bs
    
    r = requests.get('https://www.amazon.com/gp/help/customer/display.html/ref=hp_left_v4_sib?ie=UTF8&nodeId=G54HPVAW86CHYHKS')
    soup = bs(r.content, 'lxml')
    text = [i.text.strip() for i in soup.select('p:has(strong:contains("Here’s what’s new:")), p:has(strong:contains("Here’s what’s new:")) + p + ul li')]
    print(text)
    
    text = [i.text.strip() for i in soup.select('p:has(strong), p:has(strong) + p + ul li')]
    print(text)