Python 从div类收集链接_Python_Web Scraping_Beautifulsoup

Python 从div类收集链接

python web-scraping

Python 从div类收集链接,python,web-scraping,beautifulsoup,Python,Web Scraping,Beautifulsoup,我在收集链接的代码中有以下部分： def Get_Links(): r = requests.get(main).text soup = BeautifulSoup(r, 'html.parser') links = [] for item in soup.findAll("a", {'class': 'ap-area-link'}): links.append(item.get("href")) return links 如果网页源是： <a class="ap-area-li

我在收集链接的代码中有以下部分：

def Get_Links():
r = requests.get(main).text
soup = BeautifulSoup(r, 'html.parser')
links = []
for item in soup.findAll("a", {'class': 'ap-area-link'}):
    links.append(item.get("href"))
return links

如果网页源是：

<a class="ap-area-link" href="https://www.webpage.com/product/item/">Item</a>

但是我的链接列表是空的吗？

您可以对该项目使用find方法。find_all方法返回一组答案，有点像数组。这样，您就可以对结果集的每个项使用常规bs4方法。您可以将结果集中的项目视为单个html内容

尝试替换：对于soup.findAlldiv中的项，{'class'：'large-4 medium-4 columns'}： links.appenditem.gethref

与：对于soup.findAlldiv中的项，{'class'：'large-4 medium-4'}：

links.appenditem.finda

尝试使用相邻的兄弟组合符获得h5之后的a，类如下所示

links = [i['href'] for i in soup.select('h5.show-for-small + a')]

阅读css选择器和组合器。

您可以尝试以下方法：

from bs4 import BeautifulSoup

html = """<div class="large-4 medium-4 columns">
     <h5 class="show-for-small">Product Name 1</h5>
      <a href="https://webpage.com/products/item/">Item</a>
      <h5 class="show-for-small">Product Name 2</h5>
      <a href="https://webpage.com/products/item/">Item</a>
    </div>
       """
soup = BeautifulSoup(html)

for item in soup.findAll("div", {'class': 'large-4 medium-4 columns'}):
  for n in item.find_all('a'): 
    print ('Link : '+ n.get('href'))

links = [i['href'] for i in soup.select('h5.show-for-small + a')]

from bs4 import BeautifulSoup

html = """<div class="large-4 medium-4 columns">
     <h5 class="show-for-small">Product Name 1</h5>
      <a href="https://webpage.com/products/item/">Item</a>
      <h5 class="show-for-small">Product Name 2</h5>
      <a href="https://webpage.com/products/item/">Item</a>
    </div>
       """
soup = BeautifulSoup(html)

for item in soup.findAll("div", {'class': 'large-4 medium-4 columns'}):
  for n in item.find_all('a'): 
    print ('Link : '+ n.get('href'))

Link : https://webpage.com/products/item/
Link : https://webpage.com/products/item/