Python can'；t使用bs4在div中收集href_Python_Web Scraping_Beautifulsoup

Python can'；t使用bs4在div中收集href

python web-scraping

Python can'；t使用bs4在div中收集href,python,web-scraping,beautifulsoup,Python,Web Scraping,Beautifulsoup,我是一个新手，尝试使用bs4废弃这个网站，从指定的div收集href，然后通过hrefs浏览产品页面并收集数据，但我一直在收集href。如果有人帮我解决这个问题，我会非常高兴： import urllib.request from bs4 import BeautifulSoup urlpage = 'https://www.digikala.com/search/category-tire/' print(urlpage) # scrape the webpage using beau

我是一个新手，尝试使用bs4废弃这个网站，从指定的div收集href，然后通过hrefs浏览产品页面并收集数据，但我一直在收集href。如果有人帮我解决这个问题，我会非常高兴：

import urllib.request
from bs4 import BeautifulSoup

urlpage = 'https://www.digikala.com/search/category-tire/' 
print(urlpage)

# scrape the webpage using beautifulsoup

# query the website and return the html to the variable 'page'
page = urllib.request.urlopen(urlpage)

# parse the html using beautiful soup and store in variable 'soup'
soup = BeautifulSoup(page, 'html.parser')

# find product items
results = soup.find_all('div', attrs={'class': 'c-product-box__title'})
print('BeautifulSoup - Number of results', len(results))

这是第一个结果虽然当你打印结果时，它会有36个div，我只是复制了第一个，我尽力不去问并找到答案，但我甚至没有接近，所以如果它这么简单，我很抱歉

#-*-编码：utf-8-*-
html_doc=“”
从bs4导入BeautifulSoup
soup=BeautifulSoup（html_doc，'html.parser'）
对于汤中的div.find_all（'div'，class='c-product-box.'u title'）：
打印分区a['href']

输出：

$python a.py
/product/dkp-539563/dkp-539563/dkp-539563/dkp-539563-3-2055R16-3-2055R16-8-7

请参阅。

对于每个生成的

div

，首先获取子

元素，然后获取其

href

属性的值，如下所示：

results = soup.find_all('div', attrs={'class': 'c-product-box__title'})
print('BeautifulSoup - Number of results', len(results))

links = []
for result in results:
    links.append(result.a['href'])

print(links)

这将产生36个链接的列表。以下是前2个链接的示例：

['/product/dkp-539563/لاستیک-خودرو-میشلن-مدل-primacy-3-سایز-20555r16-دو-حلقه',
'/product/dkp-959932/لاستیک-خودرو-گلدستون-مدل-2020-2000-سایز-1856514-دو-حلقه-مناسب-برای-انواع-رینگ-14',

您可以将类和类型选择器与子组合器结合使用，以获取div的子

标记（通过类选择器指定div）。在这种情况下，不需要限制返回的子类

import requests
from bs4 import BeautifulSoup 

url = 'https://www.digikala.com/search/category-tire/'
r = requests.get(url)
soup = BeautifulSoup(r.content,"lxml")
links = [link['href'] for link in soup.select('.c-product-box__title > a')]
print(len(links))
print(links[0])

这并没有回答问题。非常感谢，伙计，这也很有效，但我在想，在你的答案和QHar的答案之间，我应该使用哪种方法，顺便说一句，最后一行应该是打印（div.a['href']）Python2和Python3上的

print

是不同的。非常感谢您的帮助，我想知道哪个答案是更好的方法，为什么？这个方法应该更快，因为它使用css选择器。我希望对此进行优化。另外，我更喜欢上面的列表理解，而不是循环。谢谢您的回答，它有效吗我有理由用这个答案来代替其他答案吗？