Python 循环迭代的Beautiful Soup 4中的奇怪错误

Python 循环迭代的Beautiful Soup 4中的奇怪错误,python,arrays,list,for-loop,beautifulsoup,Python,Arrays,List,For Loop,Beautifulsoup,我正试图抓取一个用AJAX加载数据的站点。我想通过我放在列表中的一堆URL来实现这一点。我使用for循环进行迭代。这是我的密码 import requests from bs4 import BeautifulSoup from selenium import webdriver import pandas as pd import pdb listUrls = ['https://www.flipkart.com/samsung-galaxy-nxt-gold-32-gb/p/itmemzd

我正试图抓取一个用AJAX加载数据的站点。我想通过我放在列表中的一堆URL来实现这一点。我使用for循环进行迭代。这是我的密码

import requests
from bs4 import BeautifulSoup
from selenium import webdriver
import pandas as pd
import pdb

listUrls = ['https://www.flipkart.com/samsung-galaxy-nxt-gold-32-gb/p/itmemzd4gepexjya','https://www.flipkart.com/samsung-galaxy-on8-gold-16-gb/p/itmemvarkqg5dyay']
PHANTOMJS_PATH = './phantomjs'
browser = webdriver.PhantomJS(PHANTOMJS_PATH)

for url in listUrls:
    browser.get(url)
    soup = BeautifulSoup(browser.page_source, "html.parser")
    labels = soup.findAll('li', {'class':"_1KuY3T row"})
    print labels
当我运行这段代码时,我得到了第一个URL的结果,但是第二个URL显示为一个空白列表。我试着为这两个URL打印了一份汤,结果都成功了。只有在打印标签时,错误才会持续。打印第一个URL的标签,但第二个列表为空

[<truncated>...Formats</div><ul class="_3dG3ix col col-9-12"><li class="sNqDog">MP3</li></ul></li>, <li class="_1KuY3T row"><div class="vmXPri col col-3-12">Battery Capacity</div><ul class="_3dG3ix col col-9-12"><li class="sNqDog">3300 mAh</li></ul></li>, <li class="_1KuY3T row"><div class="vmXPri col col-3-12">Battery Type</div><ul class="_3dG3ix col col-9-12"><li class="sNqDog">Li-Ion</li></ul></li>, <li class="_1KuY3T row"><div class="vmXPri col col-3-12">Width</div><ul class="_3dG3ix col col-9-12"><li class="sNqDog">75 mm</li></ul></li>, <li class="_1KuY3T row"><div class="vmXPri col col-3-12">Height</div><ul class="_3dG3ix col col-9-12"><li class="sNqDog">151.7 mm</li></ul></li>, <li class="_1KuY3T row"><div class="vmXPri col col-3-12">Depth</div><ul class="_3dG3ix col col-9-12"><li class="sNqDog">8 mm</li></ul></li>, <li class="_1KuY3T row"><div class="vmXPri col col-3-12">Warranty Summary</div><ul class="_3dG3ix col col-9-12"><li class="sNqDog">1 Year Manufacturer Warranty</li></ul></li>]
[]

[…”vmXPri col-3-12“>深度
  • 8 mm
  • 保修总结
      一年制造商保修期 >/Users/aamnasimpl/Desktop/Scraper/web Scraper.py(12)() ->对于ListURL中的url: (Pdb)n >/Users/aamnasimpl/Desktop/Scraper/web Scraper.py(13)() ->browser.get(url) (Pdb)n >/Users/aamnasimpl/Desktop/Scraper/web Scraper.py(15)() ->soup=BeautifulSoup(browser.page_source,“html.parser”)#将所有html放入soup (Pdb)n >/Users/aamnasimpl/Desktop/Scraper/web Scraper.py(16)() ->labels=soup.findAll('li',{'class':“\u 1KuY3T row”}) (Pdb)n >/Users/aamnasimpl/Desktop/Scraper/web Scraper.py(17)() ->pdb.set_trace() (Pdb) >/Users/aamnasimpl/Desktop/Scraper/web Scraper.py(18)() ->打印标签 (Pdb)n
      [
    • 销售包装
      • 手机、适配器、耳机、用户手册
      • 型号
          J710FZDGINS,
        • Model Name它在调试时工作的事实表明这是一个计时问题。当你一步一步地调试它时,你基本上给了页面更多的加载时间,因此标签打印正确

          您需要做的是通过添加一个-等待至少一个标签出现在页面上,使事情更加可靠和具有预测性:

          from selenium.webdriver.common.by import By
          from selenium.webdriver.support.ui import WebDriverWait
          from selenium.webdriver.support import expected_conditions as EC
          
          # ...
          
          for url in listUrls:
              browser.get(url)
          
              # wait for labels to be present/rendered
              wait = WebDriverWait(browser, 20)
              wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "li._1KuY3T.row")))
          
              soup = BeautifulSoup(browser.page_source, "html.parser")
              labels = soup.select("li._1KuY3T.row")
              print(labels)
          

          它在调试时工作的事实表明这是一个时间问题。当你一步一步地调试它时,你基本上给了页面更多的时间来加载,因此标签被正确地打印出来

          您需要做的是通过添加一个-等待至少一个标签出现在页面上,使事情更加可靠和具有预测性:

          from selenium.webdriver.common.by import By
          from selenium.webdriver.support.ui import WebDriverWait
          from selenium.webdriver.support import expected_conditions as EC
          
          # ...
          
          for url in listUrls:
              browser.get(url)
          
              # wait for labels to be present/rendered
              wait = WebDriverWait(browser, 20)
              wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "li._1KuY3T.row")))
          
              soup = BeautifulSoup(browser.page_source, "html.parser")
              labels = soup.select("li._1KuY3T.row")
              print(labels)
          

          如果您可以将结果/堆栈跟踪作为文本而不是图像添加到问题中,这会有所帮助。@TeemuRisikko完成。如果您可以将结果/堆栈跟踪作为文本而不是图像添加到问题中,这会有所帮助。@TeemuRisikko完成。谢谢@alecxe!这是有效的。但是,我并不清楚,如果循环对第一个URL有效,为什么我要删除它ED要为它的第二个URL添加一个明确的等待吗?@ Dot惊慌,如果你运行代码,比如说几百次,我敢打赌,你也会看到它在第一个代码中也失败了。重点是,等待使代码变得可靠,你不必对在某个点上呈现的元素做出假设,你只是明确地等待它。找到答案来解决这个问题,谢谢。谢谢@alecxe!这很有效。不过,我并不清楚,如果循环对第一个URL有效,为什么我需要添加一个显式的等待,让它对第二个URL有效?@dontpanic好吧,如果你运行代码,比如说,一百次,我打赌你也会看到它在第一个URL上失败。关键是等待使代码可靠,您不必对某个点上呈现的元素做出假设,只需明确地等待它。请考虑接受答案来解决主题,谢谢。
          from selenium.webdriver.common.by import By
          from selenium.webdriver.support.ui import WebDriverWait
          from selenium.webdriver.support import expected_conditions as EC
          
          # ...
          
          for url in listUrls:
              browser.get(url)
          
              # wait for labels to be present/rendered
              wait = WebDriverWait(browser, 20)
              wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "li._1KuY3T.row")))
          
              soup = BeautifulSoup(browser.page_source, "html.parser")
              labels = soup.select("li._1KuY3T.row")
              print(labels)