在Python bs4中为img src创建html
我正试图用Python或bs4中的BeautifulSoup解析以下HTML代码:在Python bs4中为img src创建html,python,html,python-3.x,web-scraping,beautifulsoup,Python,Html,Python 3.x,Web Scraping,Beautifulsoup,我正试图用Python或bs4中的BeautifulSoup解析以下HTML代码: <div class="product w-100" data-pid="BBOMNLV1-36183" data-sid="BBOMNLWB"> <div class="product-tile w-100"> <!-- dwMarker="prod
<div class="product w-100" data-pid="BBOMNLV1-36183" data-sid="BBOMNLWB">
<div class="product-tile w-100">
<!-- dwMarker="product" dwContentID="c4e921241579720afa4287dbf5" -->
<div class="image-container">
<a href="/pd/omn1s-low/BBOMNLV1-36183.html?dwvar_BBOMNLV1-36183_style=BBOMNLWB">
<picture>
<source type="image/jpeg" data-srcset="https://nb.scene7.com/is/image/NB/bbomnlwb_nb_02_i?$pdpflexf2$&wid=440&hei=440 1x, https://nb.scene7.com/is/image/NB/bbomnlwb_nb_02_i?$pdpflexf2$&wid=880&hei=880 2x" srcset="https://nb.scene7.com/is/image/NB/bbomnlwb_nb_02_i?$pdpflexf2$&wid=440&hei=440 1x, https://nb.scene7.com/is/image/NB/bbomnlwb_nb_02_i?$pdpflexf2$&wid=880&hei=880 2x"> <img class="tile-image ls-is-cached lazyloaded" src="https://nb.scene7.com/is/image/NB/bbomnlwb_nb_02_i?$pdpflexf2$&wid=440&hei=440" data-src="https://nb.scene7.com/is/image/NB/bbomnlwb_nb_02_i?$pdpflexf2$&wid=440&hei=440" data-srcset="https://nb.scene7.com/is/image/NB/bbomnlwb_nb_02_i?$pdpflexf2SM$&wid=440&hei=440 1x, https://nb.scene7.com/is/image/NB/bbomnlwb_nb_02_i?$pdpflexf2SM$&wid=880&hei=880 2x" alt="OMN1S Low" title="OMN1S Low, BBOMNLWB" itemprop="image" srcset="https://nb.scene7.com/is/image/NB/bbomnlwb_nb_02_i?$pdpflexf2SM$&wid=440&hei=440 1x, https://nb.scene7.com/is/image/NB/bbomnlwb_nb_02_i?$pdpflexf2SM$&wid=880&hei=880 2x"> </picture>
</a>
<div class="product-id d-none">BBOMNLV1-36183</div>
<div class="wishlist-url d-none">/on/demandware.store/Sites-NBUS-Site/en_US/Wishlist-WishlistItemExists</div> <span class="wishListToggle">
<a class="wishlistTile add-to-wish-list" href="/on/demandware.store/Sites-NBUS-Site/en_US/Wishlist-AddProduct" title="Wish list">
<span class="wishlist-inactive active">
<svg role="img" class="icon svg-icon " width="24" height="24" aria-label="title">
<title> </title>
<desc> </desc>
<use xlink:href="#wishlist-inactive"></use>
</svg></span> </a>
<a class="wishlistTile remove-from-wishlist" href="/on/demandware.store/Sites-NBUS-Site/en_US/Wishlist-RemoveProduct" title="Wish list"> <span class="wishlist-active ">
<svg role="img" class="icon svg-icon " width="24" height="24" aria-label="title">
<title> </title>
<desc> </desc>
<use xlink:href="#wishlist-active"></use>
</svg></span> </a>
</span>
</div>
<div class="tile-body">
<div class="row pgp-grid pb-2 pr-2">
<div class="col-12 col-lg-7 pl-2 fw-search">
<div class="pdp-link"> <a class="link font-weight-bold pname text-underline no-underline-lg" href="/pd/omn1s-low/BBOMNLV1-36183.html?dwvar_BBOMNLV1-36183_style=BBOMNLWB">OMN1S Low</a> <span class="category-name font-body w-100 d-block pt-2">
Men's Basketball
</span> </div>
</div>
<div class="col-12 col-lg-5 pl-2 fw-search justify-content-lg-end text-right d-flex p-0 search-tile">
<div class="price"> <span class="price-value">
<span class="sales font-body-large ">
$139.99
</span> </span>
</div>
</div>
</div>
<div class="pgp-reviews-wrapper" data-pageid="BBOMNLV1-36183" data-url="https://www.newbalance.com/on/demandware.store/Sites-NBUS-Site/en_US/ProductReviews-WriteReview?pid=BBOMNLV1-36183" id="BBOMNLV1-36183-pgp-reviews-wrapper-3">
<div class="p-w-r">
<section id="pr-category-snippets-BBOMNLV1-36183" class="pr-no-reviews" aria-labelledby="pr-UbCtutN-xQJECAE6zEJSy" data-testid="category-snippet">
<div class="pr-snippet pr-category-snippet">
<div class="pr-category-snippet__rating pr-category-snippet__item">
<div class="pr-snippet-stars pr-snippet-stars-png ">
<div aria-hidden="true" class="pr-rating-stars">
<div class="pr-star-v4 pr-star-v4-0-filled"></div>
<div class="pr-star-v4 pr-star-v4-0-filled"></div>
<div class="pr-star-v4 pr-star-v4-0-filled"></div>
<div class="pr-star-v4 pr-star-v4-0-filled"></div>
<div class="pr-star-v4 pr-star-v4-0-filled"></div>
</div>
<div aria-hidden="true" class="pr-snippet-rating-decimal">0.0</div>
</div><span id="pr-UbCtutN-xQJECAE6zEJSy" class="pr-accessible-text">Rated 0 out of 5 stars</span></div>
<div class="pr-category-snippet__total pr-category-snippet__item">No Reviews</div>
</div>
</section>
</div>
</div>
</div>
<div class="badges"> <span class="sub-badges p-1 text-uppercase font-weight-bold">NEW</span> </div>
<!-- END_dwmarker -->
</div>
</div>
如何修复代码以检索图像链接?您获得的属性似乎是
href
属性,这就是链接所在的位置。顺便说一下,将您提供的html
参数设置为长html
代码
def queryNewBalance(html):
#r = requests.get('https://www.newbalance.com/men/shoes/basketball/?prefn1=color&prefv1=Black%7CBlue&srule=null')
soup = BeautifulSoup(html, 'html.parser')
result = soup.find_all('div', class_='product w-100')
for res in result:
print("*******************************")
print(res.find('img', class_='tile-image ls-is-cached lazyloaded')['src']) #Picture
print("*******************************")
print(f"\nFound total shoes: {len(result)}")
queryNewBalance(html)
输出
*******************************
https://nb.scene7.com/is/image/NB/bbomnlwb_nb_02_i?$pdpflexf2$&wid=440&hei=440
*******************************
Found total shoes: 1
[Finished in 0.7s]
---URL---
输出:
*******************************
https://nb.scene7.com/is/image/NB/bbomnxbb_nb_02_i?$pdpflexf2$&wid=440&hei=440
*******************************
*******************************
https://nb.scene7.com/is/image/NB/bbomnlpl_nb_02_i?$pdpflexf2$&wid=440&hei=440
*******************************
*******************************
https://nb.scene7.com/is/image/NB/bbomnlwb_nb_02_i?$pdpflexf2$&wid=440&hei=440
*******************************
*******************************
https://nb.scene7.com/is/image/NB/bbomnlbr_nb_02_i_5a34b3da900d437a9a88?$pdpflexf2$&wid=440&hei=440
*******************************
*******************************
https://nb.scene7.com/is/image/NB/bbomnlfc_nb_02_i?$pdpflexf2$&wid=440&hei=440
*******************************
*******************************
https://nb.scene7.com/is/image/NB/bbomnlwt_nb_02_i?$pdpflexf2$&wid=440&hei=440
*******************************
Found total shoes: 6
[Finished in 2.9s]
*******************************
https://nb.scene7.com/is/image/NB/bbomnxbb_nb_02_i?$pdpflexf2$&wid=440&hei=440
*******************************
https://nb.scene7.com/is/image/NB/bbomnlpl_nb_02_i?$pdpflexf2$&wid=440&hei=440
*******************************
https://nb.scene7.com/is/image/NB/bbomnlwb_nb_02_i?$pdpflexf2$&wid=440&hei=440
*******************************
https://nb.scene7.com/is/image/NB/bbomnlbr_nb_02_i_5a34b3da900d437a9a88?$pdpflexf2$&wid=440&hei=440
*******************************
https://nb.scene7.com/is/image/NB/bbomnlfc_nb_02_i?$pdpflexf2$&wid=440&hei=440
*******************************
https://nb.scene7.com/is/image/NB/bbomnlwt_nb_02_i?$pdpflexf2$&wid=440&hei=440
附言:
如果您更多地参与到web抓取中,并且抓取大量的网站,尤其是大型网站,我建议您将解析器更改为
html5lib
->pip安装html5lib
。它是一个更好的解析器,因为我在使用html.parser
时遇到了一些问题,它只是没有以某种方式刮取网站的某些部分,尽管我检查了soup对象的位置,不管怎样,您的呼叫,祝您好运 页面上没有类磁贴图像ls被缓存为懒散加载的。要获取图像的链接,可以使用CSS选择器img[itemprop='image']
:
import requests
from bs4 import BeautifulSoup
def queryNewBalance(url):
r = requests.get(url)
soup = BeautifulSoup(r.content, "html.parser")
result = soup.find_all("div", class_="product w-100")
for res in result:
print("*******************************")
print(res.select_one("img[itemprop='image']")["data-src"])
print(f"\nFound total shoes: {len(result)}")
queryNewBalance(
"https://www.newbalance.com/men/shoes/basketball/?prefn1=color&prefv1=Black%7CBlue&srule=null"
)
输出:
*******************************
https://nb.scene7.com/is/image/NB/bbomnxbb_nb_02_i?$pdpflexf2$&wid=440&hei=440
*******************************
*******************************
https://nb.scene7.com/is/image/NB/bbomnlpl_nb_02_i?$pdpflexf2$&wid=440&hei=440
*******************************
*******************************
https://nb.scene7.com/is/image/NB/bbomnlwb_nb_02_i?$pdpflexf2$&wid=440&hei=440
*******************************
*******************************
https://nb.scene7.com/is/image/NB/bbomnlbr_nb_02_i_5a34b3da900d437a9a88?$pdpflexf2$&wid=440&hei=440
*******************************
*******************************
https://nb.scene7.com/is/image/NB/bbomnlfc_nb_02_i?$pdpflexf2$&wid=440&hei=440
*******************************
*******************************
https://nb.scene7.com/is/image/NB/bbomnlwt_nb_02_i?$pdpflexf2$&wid=440&hei=440
*******************************
Found total shoes: 6
[Finished in 2.9s]
*******************************
https://nb.scene7.com/is/image/NB/bbomnxbb_nb_02_i?$pdpflexf2$&wid=440&hei=440
*******************************
https://nb.scene7.com/is/image/NB/bbomnlpl_nb_02_i?$pdpflexf2$&wid=440&hei=440
*******************************
https://nb.scene7.com/is/image/NB/bbomnlwb_nb_02_i?$pdpflexf2$&wid=440&hei=440
*******************************
https://nb.scene7.com/is/image/NB/bbomnlbr_nb_02_i_5a34b3da900d437a9a88?$pdpflexf2$&wid=440&hei=440
*******************************
https://nb.scene7.com/is/image/NB/bbomnlfc_nb_02_i?$pdpflexf2$&wid=440&hei=440
*******************************
https://nb.scene7.com/is/image/NB/bbomnlwt_nb_02_i?$pdpflexf2$&wid=440&hei=440
六羟甲基三聚氰胺六甲醚。。您正在执行导入bs4的操作,我认为它应该是来自bs4导入BeautifulSoup的,或者您只是没有粘贴它here@IceBear编辑