Python-沃尔玛的类别名称Web抓取
我想从这个沃尔玛得到部门名称。您可以看到,首先在Python-沃尔玛的类别名称Web抓取,python,selenium,web-scraping,Python,Selenium,Web Scraping,我想从这个沃尔玛得到部门名称。您可以看到,首先在部门的左侧有7个部门(巧克力饼干、饼干、黄油饼干等)。当我单击“查看所有部门”时,又添加了9个类别,因此现在的数字是16。我正试图让所有16个部门都自动完成。我写了这段代码 from selenium import webdriver n_links = [] driver = webdriver.Chrome(executable_path='D:/Desktop/demo/chromedriver.exe') url = "htt
部门的左侧有7个部门(巧克力饼干、饼干、黄油饼干等)。当我单击“查看所有部门”时,又添加了9个类别,因此现在的数字是16。我正试图让所有16个部门都自动完成。我写了这段代码
from selenium import webdriver
n_links = []
driver = webdriver.Chrome(executable_path='D:/Desktop/demo/chromedriver.exe')
url = "https://www.walmart.com/browse/snacks-cookies-chips/cookies/976759_976787_1001391"
driver.get(url)
search = driver.find_element_by_xpath("//*[@id='Departments']/div/div/ul").text
driver.find_element_by_xpath("//*[@id='Departments']/div/div/button/span").click()
search2 = driver.find_element_by_xpath("//*[@id='Departments']/div/div/div/div").text
sep = search.split('\n')
sep2 = search2.split('\n')
lngth = len(sep)
lngth2 = len(sep2)
for i in range (1,lngth):
path = "//*[@id='Departments']/div/div/ul/li"+"["+ str(i) + "]/a"
nav_links = driver.find_element_by_xpath(path).get_attribute('href')
n_links.append(nav_links)
for i in range (1,lngth2):
path = "//*[@id='Departments']/div/div/div/div/ul/li"+"["+ str(i) + "]/a"
nav_links2 = driver.find_element_by_xpath(path).get_attribute('href')
n_links.append(nav_links2)
print(n_links)
print(len(n_links))
最后,当我运行代码时,我可以看到n_links
数组中的链接。但问题是,;有时有13个链接,有时有14个。应该是16岁,我还没见过16岁,只有13或14岁。我尝试在search2
行之前添加time.sleep(3)
,但没有成功。你能帮我吗?我想你把事情弄得更复杂了。如果单击按钮,您可能需要等待获取部门,这是正确的
# This code will get all the departments shown
departments = []
departments = driver.find_elements_by_xpath("//li[contains(@class,'department')]")
# Click on the show all departments button
driver.find_element_by_xpath("//button[@data-automation-id='button']//span[contains(text(),'all Departments')]").click()
# Will get the departments shown
departments = driver.find_elements_by_xpath("//li[contains(@class,'department')]")
# Iterate through the departments
for d in departments:
print(d)
要打印所有产品(16),您可以尝试使用CSS选择器来搜索它们:.collapsable content>ul a.。有时显示为
在您的示例中:
from selenium import webdriver
driver = webdriver.Chrome()
url = (
"https://www.walmart.com/browse/snacks-cookies-chips/cookies/976759_976787_1001391"
)
driver.get(url)
search = driver.find_element_by_xpath("//*[@id='Departments']/div/div/ul").text
driver.find_element_by_xpath("//*[@id='Departments']/div/div/button/span").click()
all_departments = [
link.get_attribute("href")
for link in driver.find_elements_by_css_selector(
".collapsible-content > ul a, .sometimes-shown a"
)
]
print(len(all_departments))
print(all_departments)
输出:
16
['https://www.walmart.com/browse/food/chocolate-cookies/976759_976787_1001391_4007138', 'https://www.walmart.com/browse/food/cookies/976759_976787_1001391_8331066', 'https://www.walmart.com/browse/food/butter-cookies/976759_976787_1001391_7803640', 'https://www.walmart.com/browse/food/shortbread-cookies/976759_976787_1001391_8026949', 'https://www.walmart.com/browse/food/coconut-cookies/976759_976787_1001391_6970757', 'https://www.walmart.com/browse/food/healthy-cookies/976759_976787_1001391_7466302', 'https://www.walmart.com/browse/food/keebler-cookies/976759_976787_1001391_3596825', 'https://www.walmart.com/browse/food/biscotti/976759_976787_1001391_2224095', 'https://www.walmart.com/browse/food/gluten-free-cookies/976759_976787_1001391_4362193', 'https://www.walmart.com/browse/food/molasses-cookies/976759_976787_1001391_3338971', 'https://www.walmart.com/browse/food/peanut-butter-cookies/976759_976787_1001391_6460174', 'https://www.walmart.com/browse/food/pepperidge-farm-cookies/976759_976787_1001391_2410932', 'https://www.walmart.com/browse/food/snickerdoodle-cookies/976759_976787_1001391_8926167', 'https://www.walmart.com/browse/food/sugar-free-cookies/976759_976787_1001391_5314659', 'https://www.walmart.com/browse/food/tate-s-cookies/976759_976787_1001391_9480535', 'https://www.walmart.com/browse/food/vegan-cookies/976759_976787_1001391_8007359']
仅使用美化组
:
import json
import requests
from bs4 import BeautifulSoup
url = "https://www.walmart.com/browse/snacks-cookies-chips/cookies/976759_976787_1001391"
headers = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:88.0) Gecko/20100101 Firefox/88.0",
"Accept-Language": "en-US,en;q=0.5",
}
soup = BeautifulSoup(requests.get(url, headers=headers).content, "html.parser")
data = json.loads(soup.select_one("#searchContent").contents[0])
# uncomment to see all data:
# print(json.dumps(data, indent=4))
def find_departments(data):
if isinstance(data, dict):
if "name" in data and data["name"] == "Departments":
yield data
else:
for v in data.values():
yield from find_departments(v)
elif isinstance(data, list):
for v in data:
yield from find_departments(v)
departments = next(find_departments(data), {})
for d in departments.get("values", []):
print(
"{:<30} {}".format(
d["name"], "https://www.walmart.com" + d["baseSeoURL"]
)
)
为什么不使用所有元素的可见性
texts = []
links =[]
driver.get('https://www.walmart.com/browse/snacks-cookies-chips/cookies/976759_976787_1001391')
wait = WebDriverWait(driver, 60)
wait.until(EC.element_to_be_clickable((By.XPATH, "//span[text()='See all Departments']/parent::button"))).click()
elements = wait.until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "li.department-single-level a")))
for element in elements:
#to get text
texts.append(element.text)
#to get link by attribute name
links.append(element.get_attribute('href'))
print(texts)
print(links)
控制台输出:
[u'Chocolate Cookies', u'Cookies', u'Butter Cookies', u'Shortbread Cookies', u'Coconut Cookies', u'Healthy Cookies', u'Keebler Cookies', u'Biscotti', u'Gluten-Free Cookies', u'Molasses Cookies', u'Peanut Butter Cookies', u'Pepperidge Farm Cookies', u'Snickerdoodle Cookies', u'Sugar-Free Cookies', u"Tate's Cookies", u'Vegan Cookies']
[u'https://www.walmart.com/browse/food/chocolate-cookies/976759_976787_1001391_4007138', u'https://www.walmart.com/browse/food/cookies/976759_976787_1001391_8331066', u'https://www.walmart.com/browse/food/butter-cookies/976759_976787_1001391_7803640', u'https://www.walmart.com/browse/food/shortbread-cookies/976759_976787_1001391_8026949', u'https://www.walmart.com/browse/food/coconut-cookies/976759_976787_1001391_6970757', u'https://www.walmart.com/browse/food/healthy-cookies/976759_976787_1001391_7466302', u'https://www.walmart.com/browse/food/keebler-cookies/976759_976787_1001391_3596825', u'https://www.walmart.com/browse/food/biscotti/976759_976787_1001391_2224095', u'https://www.walmart.com/browse/food/gluten-free-cookies/976759_976787_1001391_4362193', u'https://www.walmart.com/browse/food/molasses-cookies/976759_976787_1001391_3338971', u'https://www.walmart.com/browse/food/peanut-butter-cookies/976759_976787_1001391_6460174', u'https://www.walmart.com/browse/food/pepperidge-farm-cookies/976759_976787_1001391_2410932', u'https://www.walmart.com/browse/food/snickerdoodle-cookies/976759_976787_1001391_8926167', u'https://www.walmart.com/browse/food/sugar-free-cookies/976759_976787_1001391_5314659', u'https://www.walmart.com/browse/food/tate-s-cookies/976759_976787_1001391_9480535', u'https://www.walmart.com/browse/food/vegan-cookies/976759_976787_1001391_8007359']
需要以下导入:
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
可以使用
beautifulsoup
?没关系,我们可以@AndrejKesely