Python 无法耗尽我的刮板中使用的所有相同URL的内容_Python_Python 3.x_Web Scraping_Beautifulsoup

Python 无法耗尽我的刮板中使用的所有相同URL的内容

python python-3.x web-scraping

Python 无法耗尽我的刮板中使用的所有相同URL的内容,python,python-3.x,web-scraping,beautifulsoup,Python,Python 3.x,Web Scraping,Beautifulsoup,我已经用python编写了一个scraper，它使用BeautifulSoup库来解析网站不同页面上的所有名称。我可以管理它，如果它不是一个以上的网址与不同的分页，这意味着有些网址有分页，有些没有，因为内容很少我的问题是：如何在一个函数中编译它们来处理它们是否有分页我最初的尝试（它只能解析每个url第一页的内容）：如果有一个带有如下分页的url，我本可以完成全部工作： from bs4 import BeautifulSoup import requests page_no = 0 p

我已经用python编写了一个scraper，它使用BeautifulSoup库来解析网站不同页面上的所有名称。我可以管理它，如果它不是一个以上的网址与不同的分页，这意味着有些网址有分页，有些没有，因为内容很少

我的问题是：如何在一个函数中编译它们来处理它们是否有分页

我最初的尝试（它只能解析每个url第一页的内容）：

如果有一个带有如下分页的url，我本可以完成全部工作：

from bs4 import BeautifulSoup 
import requests

page_no = 0
page_link = "https://www.mobilehome.net/mobile-home-park-directory/new-hampshire/all/page/{}"

while True:
    page_no+=1
    res = requests.get(page_link.format(page_no))
    soup = BeautifulSoup(res.text,'lxml')
    container = soup.select("td[class='table-row-price']")
    if len(container)<=1:break 

    for content in container:
        title = content.select_one("h2 a").text
        print(title)

从bs4导入美化组
导入请求
页码=0
页面链接=”https://www.mobilehome.net/mobile-home-park-directory/new-hampshire/all/page/{}"
尽管如此：
页码_no+=1
res=requests.get（page\u link.format（page\u no））
soup=BeautifulSoup（res.text，'lxml'）
容器=汤。选择（“td[class='table-row-price']”）
如果len（container）此解决方案尝试查找分页a
标记。如果找到任何分页，则当用户迭代类的实例时，所有页面都将被刮除PageScraper
。如果不是，则仅对第一个结果（单个页面）进行爬网：
import requests
from bs4 import BeautifulSoup as soup
import contextlib
def has_pagination(f):
  def wrapper(cls):
     if not cls._pages:
       raise ValueError('No pagination found')
     return f(cls)
  return wrapper

class PageScraper:
   def __init__(self, url:str):
     self.url = url
     self._home_page = requests.get(self.url).text
     self._pages = [i.text for i in soup(self._home_page, 'html.parser').find('div', {'class':'pagination'}).find_all('a')][:-1]
   @property
   def first_page(self):
      return [i.find('h2', {'class':'table-row-heading'}).text for i in soup(self._home_page, 'html.parser').find_all('td', {'class':'table-row-price'})]
   @has_pagination
   def __iter__(self):
     for p in self._pages:
        _link = requests.get(f'{self.url}/page/{p}').text
        yield [i.find('h2', {'class':'table-row-heading'}).text for i in soup(_link, 'html.parser').find_all('td', {'class':'table-row-price'})]
   @classmethod
   @contextlib.contextmanager
   def feed_link(cls, link):
      results = cls(link)
      try:
        yield results.first_page
        for i in results:
          yield i
      except:
         yield results.first_page

类的构造函数将找到任何分页，只有找到分页链接时，\uuu iter\uuu
方法才会收集所有页面。例如，没有分页。因此：
r = PageScraper('https://www.mobilehome.net/mobile-home-park-directory/rhode-island/all')
pages = [i for i in r]

ValueError:未找到分页
但是，可以找到第一页内容：
print(r.first_page)
['Forest Park MHP', 'Gansett Mobile Home Park', 'Meadowlark Park', 'Indian Cedar Mobile Homes Inc', 'Sherwood Valley Adult Mobile', 'Tripp Mobile Home Park', 'Ramblewood Estates', 'Countryside Trailer Park', 'Village At Wordens Pond', 'Greenwich West Inc', 'Dadson Mobile Home Estates', "Oliveira's Garage", 'Tuckertown Village Clubhouse', 'Westwood Estates']

但是，对于具有完整分页的页面，可以刮取所有生成的页面：
r = PageScraper('https://www.mobilehome.net/mobile-home-park-directory/maine/all')
d = [i for i in r]

PageScraper.feed_link
将自动完成此检查，并输出第一页，如果找到分页，则输出所有后续结果，如果结果中不存在分页，则仅输出第一页：
urls = {'https://www.mobilehome.net/mobile-home-park-directory/maine/all', 'https://www.mobilehome.net/mobile-home-park-directory/rhode-island/all', 'https://www.mobilehome.net/mobile-home-park-directory/vermont/all', 'https://www.mobilehome.net/mobile-home-park-directory/new-hampshire/all'}
for url in urls:
   with PageScraper.feed_link(url) as r:
      print(r)

看来我找到了一个解决这个问题的强有力的办法。这种方法是迭代的。它将首先检查下一页是否有可用的url。如果它找到一个，那么它将跟踪该url并重复该过程。但是，如果任何链接没有任何分页，刮板将断开并尝试另一个链接
这是：
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup

urls = [
        'https://www.mobilehome.net/mobile-home-park-directory/alaska/all',
        'https://www.mobilehome.net/mobile-home-park-directory/rhode-island/all',
        'https://www.mobilehome.net/mobile-home-park-directory/maine/all',
        'https://www.mobilehome.net/mobile-home-park-directory/vermont/all'
    ]

def get_names(link):
    while True:
        r = requests.get(link)
        soup = BeautifulSoup(r.text,"lxml")
        for items in soup.select("td[class='table-row-price']"):
            name = items.select_one("h2 a").text
            print(name)

        nextpage = soup.select_one(".pagination a.next_page")

        if not nextpage:break  #If no pagination url is there, it will break and try another link

        link = urljoin(link,nextpage.get("href"))

if __name__ == '__main__':
    for url in urls:
        get_names(url)

很好的i实现@Ajax1234。不过，我对你的上述方法有一个问题。@contextlib.contextmanager在这里做什么，我的意思是这是必要的吗？你的文章值得一提。asmitu上下文管理器有多种用途，特别是在。然而，更广泛地说，上下文管理器遵循一个构建、分解的顺序，这可能是一种有用的设计模式。在这种情况下，它不是严格必需的，但是它比装饰器稍微更干净，实现起来也更简单。
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup

urls = [
        'https://www.mobilehome.net/mobile-home-park-directory/alaska/all',
        'https://www.mobilehome.net/mobile-home-park-directory/rhode-island/all',
        'https://www.mobilehome.net/mobile-home-park-directory/maine/all',
        'https://www.mobilehome.net/mobile-home-park-directory/vermont/all'
    ]

def get_names(link):
    while True:
        r = requests.get(link)
        soup = BeautifulSoup(r.text,"lxml")
        for items in soup.select("td[class='table-row-price']"):
            name = items.select_one("h2 a").text
            print(name)

        nextpage = soup.select_one(".pagination a.next_page")

        if not nextpage:break  #If no pagination url is there, it will break and try another link

        link = urljoin(link,nextpage.get("href"))

if __name__ == '__main__':
    for url in urls:
        get_names(url)