使用Beautifulsoup:lowes stores的Python web抓取_Python_Web Scraping_Beautifulsoup

使用Beautifulsoup:lowes stores的Python web抓取

python web-scraping

使用Beautifulsoup:lowes stores的Python web抓取,python,web-scraping,beautifulsoup,Python,Web Scraping,Beautifulsoup,我对刮擦还不熟悉。我被要求从以下网站获取门店编号、城市、州的列表：以下是我到目前为止所做的尝试。由于结构没有属性，我不确定如何继续我的代码。请导游 import requests from bs4 import BeautifulSoup import json from pandas import DataFrame as df url = "https://www.lowes.com/Lowes-Stores" headers = {'User-Agent': 'Mo

我对刮擦还不熟悉。我被要求从以下网站获取门店编号、城市、州的列表：

以下是我到目前为止所做的尝试。由于结构没有属性，我不确定如何继续我的代码。请导游

import requests
from bs4 import BeautifulSoup
import json
from pandas import DataFrame as df

url = "https://www.lowes.com/Lowes-Stores"
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}

page = requests.get(url, headers=headers)
page.encoding = 'ISO-885901'
soup = BeautifulSoup(page.text, 'html.parser')

lowes_list = soup.find_all(class_ = "list unstyled")
for i in lowes_list[:2]:
    print(i)

example = lowes_list[0]
example_content = example.contents
example_content

您已经在for循环中找到了包含状态存储查找所需链接的列表元素。您需要从每个“li”元素内的“a”标记中获取href属性

这只是第一步，因为您需要按照这些链接获取每个状态的存储结果

由于您知道此状态链接结果的结构，您只需执行以下操作：

for i in lowes_list:
     list_items = i.find_all('li')
     for x in list_items:
         for link in x.find_all('a'):
             print(link['href'])

当然有更有效的方法可以做到这一点，但列表非常小，而且很有效

一旦您拥有每个州的链接，您就可以为每个州创建另一个访问这些商店结果页面的请求。然后从每个州页面上的搜索结果链接中获取href属性。

<a href="/store/AK-Anchorage/0289">Anchorage Lowe's</a>

我在页面上看不到城市，只有州。单击州后，它将显示每个州下的所有城市，并且每个州的门店号都应该是？还是城市？我只需要知道每个州有多少家商店。非常感谢！我仍在试图找出如何实现这一点…我创建了一个完整的例子，应该有助于说明这一点所需的不同部分。关键组件是字符串拆分和创建链接列表，然后请求列表中的每个链接。

import requests
from bs4 import BeautifulSoup as bs


url = "https://www.lowes.com/Lowes-Stores"
headers = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36"
}

page = requests.get(url, headers=headers, timeout=5)
page.encoding = "ISO-885901"
soup = bs(page.text, "html.parser")

lowes_state_lists = soup.find_all(class_="list unstyled")

# we will store the links for each state in this array
state_stores_links = []

# now we populate the state_stores_links array by finding the href in each li tag
for ul in lowes_state_lists:
    list_items = ul.find_all("li")
    # now we have all the list items from the page, we have to extract the href
    for li in list_items:
        for link in li.find_all("a"):
            state_stores_links.append(link["href"])

# This next part is what the original question was missing, following the state links to their respective search result pages. 

# at this point we have to request a new page for each state and store the results
# you can use pandas, but an dict works too.
states_stores = {}


for link in state_stores_links:
    # splitting up the link on the / gives us the parts of the URLs.
    # by inspecting with Chrome DevTools, we can see that each state follows the same pattern (state name and state abbreviation)
    link_components = link.split("/")
    state_name = link_components[2]
    state_abbreviation = link_components[3]

    # let's use the state_abbreviation as the dict's key, and we will have a stores array that we can do reporting on
    # the type and shape of this dict is irrelevant at this point.  This example illustrates how to obtain the info you're after
    # in the end the states_stores[state_abbreviation]['stores'] array will dicts each with a store_number and a city key
    states_stores[state_abbreviation] = {"state_name": state_name, "stores": []}

    try:
        # simple error catching in case something goes wrong, since we are sending many requests.
        # our link is just the second half of the URL, so we have to craft the new one.
        new_link = "https://www.lowes.com" + link
        state_search_results = requests.get(new_link, headers=headers, timeout=5)
        stores = []
        if state_search_results.status_code == 200:
            store_directory = bs(state_search_results.content, "html.parser")
            store_directory_div = store_directory.find("div", class_="storedirectory")
            # now we get the links inside the storedirectory div
            individual_store_links = store_directory_div.find_all("a")
            # we now have all the stores for this state! Let's parse and save them into our store dict
            # the store's city is after the state's abbreviation followed by a dash, the store number is the last thing in the link
            # example: "/store/AK-Wasilla/2512"
            for store in individual_store_links:
                href = store["href"]
                try:
                    # by splitting the href which looks to be consistent throughout the site, we can get the info we need
                    split_href = href.split("/")
                    store_number = split_href[3]
                    # the store city is after the -, so we have to split that element up into its two parts and access the second part.
                    store_city = split_href[2].split("-")[1]
                    # creating our store dict
                    store_object = {"city": store_city, "store_number": store_number}
                    # adding the dict to our state's dict
                    states_stores[state_abbreviation]["stores"].append(store_object)
                except Exception as e:
                    print(
                        "Error getting store info from {0}. Exception: {1}".format(
                            split_href, e
                        )
                    )

            # let's print something so we can confirm our script is working
            print(
                "State store count for {0} is: {1}".format(
                    states_stores[state_abbreviation]["state_name"],
                    len(states_stores[state_abbreviation]["stores"]),
                )
            )
        else:
            print(
                "Error fetching: {0}, error code: {1}".format(
                    link, state_search_results.status_code
                )
            )
    except Exception as e:
        print("Error fetching: {0}. Exception: {1}".format(state_abbreviation, e))