Python 使用BeautifulSoup进行网页抓取时，如何移动到新页面？_Python_Pandas_Beautifulsoup

Python 使用BeautifulSoup进行网页抓取时，如何移动到新页面？

python pandas

Python 使用BeautifulSoup进行网页抓取时，如何移动到新页面？,python,pandas,beautifulsoup,Python,Pandas,Beautifulsoup,下面是从craigslist中提取记录的代码。一切都很好，但我需要能够进入下一组记录，并重复相同的过程，但作为编程新手，我被卡住了。从页面代码看，我似乎应该单击此处范围中包含的箭头按钮，直到它不包含href: <a href="/search/syp?s=120" class="button next" title="next page">next > </a> 这不是如何访问“下一步”按钮的直接答案，但这可能是您问题的解决方案。当我以前浏览过网页时，我

下面是从craigslist中提取记录的代码。一切都很好，但我需要能够进入下一组记录，并重复相同的过程，但作为编程新手，我被卡住了。从页面代码看，我似乎应该单击此处范围中包含的箭头按钮，直到它不包含href:

<a href="/search/syp?s=120" class="button next" title="next page">next &gt; </a>

这不是如何访问“下一步”按钮的直接答案，但这可能是您问题的解决方案。当我以前浏览过网页时，我会使用每个页面的URL来循环浏览搜索结果。在craiglist上，当您单击“下一页”时，URL会发生变化。这种变化通常有一种模式，你可以利用它。我不需要看很久，但craigslist的第二页看起来是：，第三页是。看起来URL的最后一部分每次更改120次。您可以创建一个120倍数的列表，然后构建一个for循环，将该值添加到每个URL的末尾。

然后，您的当前for循环嵌套在此for循环中。

对于您爬网的每个页面，您可以找到下一个要爬网的url并将其添加到列表中

我会这样做，而不会对代码做太多更改。我添加了一些注释，以便您了解发生了什么，但如果您需要任何额外的解释，请给我留言：

import requests
from urllib.request import urlopen
import pandas as pd
from bs4 import BeautifulSoup


base_url = 'https://nh.craigslist.org/d/computer-parts/search/syp'
base_search_url = 'https://nh.craigslist.org'
urls = []
urls.append(base_url)
dates = []
titles = []
prices = []
hoods = []

while len(urls) > 0: # while we have urls to crawl
    print(urls)
    url = urls.pop(0) # removes the first element from the list of urls
    response = requests.get(url)
    soup = BeautifulSoup(response.text,"lxml")
    next_url = soup.find('a', class_= "button next") # finds the next urls to crawl
    if next_url: # if it's not an empty string
        urls.append(base_search_url + next_url['href']) # adds next url to crawl to the list of urls to crawl

    listings = soup.find_all('li', class_= "result-row") # get all current url listings
    # this is your code unchanged
    for listing in listings:
        datar = listing.find('time', {'class': ["result-date"]}).text
        dates.append(datar)

        title = listing.find('a', {'class': ["result-title"]}).text
        titles.append(title)

        try:
            price = listing.find('span', {'class': "result-price"}).text
            prices.append(price)
        except:
            prices.append('missing')

        try:
            hood = listing.find('span', {'class': "result-hood"}).text
            hoods.append(hood)
        except:
            hoods.append('missing')

#write the lists to a dataframe
listings_df = pd.DataFrame({'Date': dates, 'Titles' : titles, 'Price' : prices, 'Location' : hoods})

 #write to a file
listings_df.to_csv("craigslist_listings.csv")

编辑：您还忘了在代码中导入

BeautifulSoup

，我在回复中添加了该选项 Edit2:您只需找到“下一步”按钮的第一个实例，因为页面可以（在本例中是这样的）有多个“下一步”按钮。

Edit3:要对计算机部件进行爬网，应将

base\u url

更改为此代码中存在的url

每个页面的请求应在循环中进行。您需要确保

find_all

每次最多为下一页锚元素返回1个元素。然后循环的下一次迭代应该请求将url设置为该元素的

href

属性。当无法再找到此元素或

href

属性为空字符串时终止循环。这是我曾考虑过的方法，但决定如果查询结束时有100页，例如，我可能不想做一个列表，尽管如果增量是一致的，那么我可以迭代120次，直到没有结果返回……我想。太好了！花了一些时间来研究代码，但我很确定我了解所有正在发生的事情。非常感谢。

import requests
from urllib.request import urlopen
import pandas as pd
from bs4 import BeautifulSoup


base_url = 'https://nh.craigslist.org/d/computer-parts/search/syp'
base_search_url = 'https://nh.craigslist.org'
urls = []
urls.append(base_url)
dates = []
titles = []
prices = []
hoods = []

while len(urls) > 0: # while we have urls to crawl
    print(urls)
    url = urls.pop(0) # removes the first element from the list of urls
    response = requests.get(url)
    soup = BeautifulSoup(response.text,"lxml")
    next_url = soup.find('a', class_= "button next") # finds the next urls to crawl
    if next_url: # if it's not an empty string
        urls.append(base_search_url + next_url['href']) # adds next url to crawl to the list of urls to crawl

    listings = soup.find_all('li', class_= "result-row") # get all current url listings
    # this is your code unchanged
    for listing in listings:
        datar = listing.find('time', {'class': ["result-date"]}).text
        dates.append(datar)

        title = listing.find('a', {'class': ["result-title"]}).text
        titles.append(title)

        try:
            price = listing.find('span', {'class': "result-price"}).text
            prices.append(price)
        except:
            prices.append('missing')

        try:
            hood = listing.find('span', {'class': "result-hood"}).text
            hoods.append(hood)
        except:
            hoods.append('missing')

#write the lists to a dataframe
listings_df = pd.DataFrame({'Date': dates, 'Titles' : titles, 'Price' : prices, 'Location' : hoods})

 #write to a file
listings_df.to_csv("craigslist_listings.csv")