使用Beautifulsoup:lowes stores的Python web抓取
我对刮擦还不熟悉。我被要求从以下网站获取门店编号、城市、州的列表: 以下是我到目前为止所做的尝试。由于结构没有属性,我不确定如何继续我的代码。请导游使用Beautifulsoup:lowes stores的Python web抓取,python,web-scraping,beautifulsoup,Python,Web Scraping,Beautifulsoup,我对刮擦还不熟悉。我被要求从以下网站获取门店编号、城市、州的列表: 以下是我到目前为止所做的尝试。由于结构没有属性,我不确定如何继续我的代码。请导游 import requests from bs4 import BeautifulSoup import json from pandas import DataFrame as df url = "https://www.lowes.com/Lowes-Stores" headers = {'User-Agent': 'Mo
import requests
from bs4 import BeautifulSoup
import json
from pandas import DataFrame as df
url = "https://www.lowes.com/Lowes-Stores"
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}
page = requests.get(url, headers=headers)
page.encoding = 'ISO-885901'
soup = BeautifulSoup(page.text, 'html.parser')
lowes_list = soup.find_all(class_ = "list unstyled")
for i in lowes_list[:2]:
print(i)
example = lowes_list[0]
example_content = example.contents
example_content
您已经在for循环中找到了包含状态存储查找所需链接的列表元素。您需要从每个“li”元素内的“a”标记中获取href属性 这只是第一步,因为您需要按照这些链接获取每个状态的存储结果 由于您知道此状态链接结果的结构,您只需执行以下操作:
for i in lowes_list:
list_items = i.find_all('li')
for x in list_items:
for link in x.find_all('a'):
print(link['href'])
当然有更有效的方法可以做到这一点,但列表非常小,而且很有效
一旦您拥有每个州的链接,您就可以为每个州创建另一个访问这些商店结果页面的请求。然后从每个州页面上的搜索结果链接中获取href属性。
<a href="/store/AK-Anchorage/0289">Anchorage Lowe's</a>
我在页面上看不到城市,只有州。单击州后,它将显示每个州下的所有城市,并且每个州的门店号都应该是?还是城市?我只需要知道每个州有多少家商店。非常感谢!我仍在试图找出如何实现这一点…我创建了一个完整的例子,应该有助于说明这一点所需的不同部分。关键组件是字符串拆分和创建链接列表,然后请求列表中的每个链接。
import requests
from bs4 import BeautifulSoup as bs
url = "https://www.lowes.com/Lowes-Stores"
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36"
}
page = requests.get(url, headers=headers, timeout=5)
page.encoding = "ISO-885901"
soup = bs(page.text, "html.parser")
lowes_state_lists = soup.find_all(class_="list unstyled")
# we will store the links for each state in this array
state_stores_links = []
# now we populate the state_stores_links array by finding the href in each li tag
for ul in lowes_state_lists:
list_items = ul.find_all("li")
# now we have all the list items from the page, we have to extract the href
for li in list_items:
for link in li.find_all("a"):
state_stores_links.append(link["href"])
# This next part is what the original question was missing, following the state links to their respective search result pages.
# at this point we have to request a new page for each state and store the results
# you can use pandas, but an dict works too.
states_stores = {}
for link in state_stores_links:
# splitting up the link on the / gives us the parts of the URLs.
# by inspecting with Chrome DevTools, we can see that each state follows the same pattern (state name and state abbreviation)
link_components = link.split("/")
state_name = link_components[2]
state_abbreviation = link_components[3]
# let's use the state_abbreviation as the dict's key, and we will have a stores array that we can do reporting on
# the type and shape of this dict is irrelevant at this point. This example illustrates how to obtain the info you're after
# in the end the states_stores[state_abbreviation]['stores'] array will dicts each with a store_number and a city key
states_stores[state_abbreviation] = {"state_name": state_name, "stores": []}
try:
# simple error catching in case something goes wrong, since we are sending many requests.
# our link is just the second half of the URL, so we have to craft the new one.
new_link = "https://www.lowes.com" + link
state_search_results = requests.get(new_link, headers=headers, timeout=5)
stores = []
if state_search_results.status_code == 200:
store_directory = bs(state_search_results.content, "html.parser")
store_directory_div = store_directory.find("div", class_="storedirectory")
# now we get the links inside the storedirectory div
individual_store_links = store_directory_div.find_all("a")
# we now have all the stores for this state! Let's parse and save them into our store dict
# the store's city is after the state's abbreviation followed by a dash, the store number is the last thing in the link
# example: "/store/AK-Wasilla/2512"
for store in individual_store_links:
href = store["href"]
try:
# by splitting the href which looks to be consistent throughout the site, we can get the info we need
split_href = href.split("/")
store_number = split_href[3]
# the store city is after the -, so we have to split that element up into its two parts and access the second part.
store_city = split_href[2].split("-")[1]
# creating our store dict
store_object = {"city": store_city, "store_number": store_number}
# adding the dict to our state's dict
states_stores[state_abbreviation]["stores"].append(store_object)
except Exception as e:
print(
"Error getting store info from {0}. Exception: {1}".format(
split_href, e
)
)
# let's print something so we can confirm our script is working
print(
"State store count for {0} is: {1}".format(
states_stores[state_abbreviation]["state_name"],
len(states_stores[state_abbreviation]["stores"]),
)
)
else:
print(
"Error fetching: {0}, error code: {1}".format(
link, state_search_results.status_code
)
)
except Exception as e:
print("Error fetching: {0}. Exception: {1}".format(state_abbreviation, e))