Python 循环迭代的Beautiful Soup 4中的奇怪错误
我正试图抓取一个用AJAX加载数据的站点。我想通过我放在列表中的一堆URL来实现这一点。我使用for循环进行迭代。这是我的密码Python 循环迭代的Beautiful Soup 4中的奇怪错误,python,arrays,list,for-loop,beautifulsoup,Python,Arrays,List,For Loop,Beautifulsoup,我正试图抓取一个用AJAX加载数据的站点。我想通过我放在列表中的一堆URL来实现这一点。我使用for循环进行迭代。这是我的密码 import requests from bs4 import BeautifulSoup from selenium import webdriver import pandas as pd import pdb listUrls = ['https://www.flipkart.com/samsung-galaxy-nxt-gold-32-gb/p/itmemzd
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
import pandas as pd
import pdb
listUrls = ['https://www.flipkart.com/samsung-galaxy-nxt-gold-32-gb/p/itmemzd4gepexjya','https://www.flipkart.com/samsung-galaxy-on8-gold-16-gb/p/itmemvarkqg5dyay']
PHANTOMJS_PATH = './phantomjs'
browser = webdriver.PhantomJS(PHANTOMJS_PATH)
for url in listUrls:
browser.get(url)
soup = BeautifulSoup(browser.page_source, "html.parser")
labels = soup.findAll('li', {'class':"_1KuY3T row"})
print labels
当我运行这段代码时,我得到了第一个URL的结果,但是第二个URL显示为一个空白列表。我试着为这两个URL打印了一份汤,结果都成功了。只有在打印标签时,错误才会持续。打印第一个URL的标签,但第二个列表为空
[<truncated>...Formats</div><ul class="_3dG3ix col col-9-12"><li class="sNqDog">MP3</li></ul></li>, <li class="_1KuY3T row"><div class="vmXPri col col-3-12">Battery Capacity</div><ul class="_3dG3ix col col-9-12"><li class="sNqDog">3300 mAh</li></ul></li>, <li class="_1KuY3T row"><div class="vmXPri col col-3-12">Battery Type</div><ul class="_3dG3ix col col-9-12"><li class="sNqDog">Li-Ion</li></ul></li>, <li class="_1KuY3T row"><div class="vmXPri col col-3-12">Width</div><ul class="_3dG3ix col col-9-12"><li class="sNqDog">75 mm</li></ul></li>, <li class="_1KuY3T row"><div class="vmXPri col col-3-12">Height</div><ul class="_3dG3ix col col-9-12"><li class="sNqDog">151.7 mm</li></ul></li>, <li class="_1KuY3T row"><div class="vmXPri col col-3-12">Depth</div><ul class="_3dG3ix col col-9-12"><li class="sNqDog">8 mm</li></ul></li>, <li class="_1KuY3T row"><div class="vmXPri col col-3-12">Warranty Summary</div><ul class="_3dG3ix col col-9-12"><li class="sNqDog">1 Year Manufacturer Warranty</li></ul></li>]
[]
[…”vmXPri col-3-12“>深度- 8 mm
,- 保修总结
一年制造商保修期
>/Users/aamnasimpl/Desktop/Scraper/web Scraper.py(12)()
->对于ListURL中的url:
(Pdb)n
>/Users/aamnasimpl/Desktop/Scraper/web Scraper.py(13)()
->browser.get(url)
(Pdb)n
>/Users/aamnasimpl/Desktop/Scraper/web Scraper.py(15)()
->soup=BeautifulSoup(browser.page_source,“html.parser”)#将所有html放入soup
(Pdb)n
>/Users/aamnasimpl/Desktop/Scraper/web Scraper.py(16)()
->labels=soup.findAll('li',{'class':“\u 1KuY3T row”})
(Pdb)n
>/Users/aamnasimpl/Desktop/Scraper/web Scraper.py(17)()
->pdb.set_trace()
(Pdb)
>/Users/aamnasimpl/Desktop/Scraper/web Scraper.py(18)()
->打印标签
(Pdb)n
[- 销售包装
- 手机、适配器、耳机、用户手册
,- 型号
J710FZDGINS
,- Model Name它在调试时工作的事实表明这是一个计时问题。当你一步一步地调试它时,你基本上给了页面更多的加载时间,因此标签打印正确
您需要做的是通过添加一个-等待至少一个标签出现在页面上,使事情更加可靠和具有预测性:
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
# ...
for url in listUrls:
browser.get(url)
# wait for labels to be present/rendered
wait = WebDriverWait(browser, 20)
wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "li._1KuY3T.row")))
soup = BeautifulSoup(browser.page_source, "html.parser")
labels = soup.select("li._1KuY3T.row")
print(labels)
它在调试时工作的事实表明这是一个时间问题。当你一步一步地调试它时,你基本上给了页面更多的时间来加载,因此标签被正确地打印出来
您需要做的是通过添加一个-等待至少一个标签出现在页面上,使事情更加可靠和具有预测性:
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
# ...
for url in listUrls:
browser.get(url)
# wait for labels to be present/rendered
wait = WebDriverWait(browser, 20)
wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "li._1KuY3T.row")))
soup = BeautifulSoup(browser.page_source, "html.parser")
labels = soup.select("li._1KuY3T.row")
print(labels)
如果您可以将结果/堆栈跟踪作为文本而不是图像添加到问题中,这会有所帮助。@TeemuRisikko完成。如果您可以将结果/堆栈跟踪作为文本而不是图像添加到问题中,这会有所帮助。@TeemuRisikko完成。谢谢@alecxe!这是有效的。但是,我并不清楚,如果循环对第一个URL有效,为什么我要删除它ED要为它的第二个URL添加一个明确的等待吗?@ Dot惊慌,如果你运行代码,比如说几百次,我敢打赌,你也会看到它在第一个代码中也失败了。重点是,等待使代码变得可靠,你不必对在某个点上呈现的元素做出假设,你只是明确地等待它。找到答案来解决这个问题,谢谢。谢谢@alecxe!这很有效。不过,我并不清楚,如果循环对第一个URL有效,为什么我需要添加一个显式的等待,让它对第二个URL有效?@dontpanic好吧,如果你运行代码,比如说,一百次,我打赌你也会看到它在第一个URL上失败。关键是等待使代码可靠,您不必对某个点上呈现的元素做出假设,只需明确地等待它。请考虑接受答案来解决主题,谢谢。
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
# ...
for url in listUrls:
browser.get(url)
# wait for labels to be present/rendered
wait = WebDriverWait(browser, 20)
wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "li._1KuY3T.row")))
soup = BeautifulSoup(browser.page_source, "html.parser")
labels = soup.select("li._1KuY3T.row")
print(labels)