Python美化组属性错误
我正在尝试使用python beautifulsoup从html内容中获取一些图像url 我的HTML内容:Python美化组属性错误,python,beautifulsoup,Python,Beautifulsoup,我正在尝试使用python beautifulsoup从html内容中获取一些图像url 我的HTML内容: <div id="photos" class="tab rel-photos multiple-photos"> <span id="watch-this" class="classified-detail-buttons"> <span id="c_id_
<div id="photos" class="tab rel-photos multiple-photos">
<span id="watch-this" class="classified-detail-buttons">
<span id="c_id_10832265:c_type_202:watch_this">
<a href="/watchlist/classified/baby-items/10832265/1/" id="watch_this_logged" data-require-auth="favoriteAd" data-tr-event-name="dpv-add-to-favourites">
<i class="fa fa-fw fa-star-o"></i></a></span>
</span>
<span id="thumb1" class=" image">
<a href="https://images.dubizzle.com/v1/files/eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJmbiI6ImYzYWdrZm8xcDBlai1EVUJJWlpMRSIsInciOlt7ImZuIjoiNWpldWk3cWZ6aWU2MS1EVUJJWlpMRSIsInMiOjUwLCJwIjoiY2VudGVyLGNlbnRlciIsImEiOjgwfV19.s1GmifnZr0_Bx4HG8RTR4puYcxN0asqAmnBvSpIExEI/image;p=main"
id="a-photo-modal-view:263986810"
rel="photos-modal"
target="_new"
onClick="return dbzglobal_event_adapter(this);">
<div style="background-image:url(https://images.dubizzle.com/v1/files/eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJmbiI6ImYzYWdrZm8xcDBlai1EVUJJWlpMRSIsInciOlt7ImZuIjoiNWpldWk3cWZ6aWU2MS1EVUJJWlpMRSIsInMiOjUwLCJwIjoiY2VudGVyLGNlbnRlciIsImEiOjgwfV19.s1GmifnZr0_Bx4HG8RTR4puYcxN0asqAmnBvSpIExEI/image;p=main);"></div>
</a>
</span>
<ul id="thumbs-list">
<li>
<span id="thumb2" class="image2">
<a href="https://images.dubizzle.com/v1/files/eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJmbiI6Imtmc3cxMWgzNTB2cTMtRFVCSVpaTEUiLCJ3IjpbeyJmbiI6IjVqZXVpN3FmemllNjEtRFVCSVpaTEUiLCJzIjo1MCwicCI6ImNlbnRlcixjZW50ZXIiLCJhIjo4MH1dfQ.Wo2YqPdWav8shtmyVO2AdisHmLX-ZLDAiskLPAmTSPU/image;p=main" id="a-photo-modal-view:263986811" rel="photos-modal" target="_new" onClick="return dbzglobal_event_adapter(this);" >
<div style="background-image:url(https://images.dubizzle.com/v1/files/eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJmbiI6Imtmc3cxMWgzNTB2cTMtRFVCSVpaTEUiLCJ3IjpbeyJmbiI6IjVqZXVpN3FmemllNjEtRFVCSVpaTEUiLCJzIjo1MCwicCI6ImNlbnRlcixjZW50ZXIiLCJhIjo4MH1dfQ.Wo2YqPdWav8shtmyVO2AdisHmLX-ZLDAiskLPAmTSPU/image;p=thumb_retina);"></div>
</a>
</span>
</li>
<li id="thumbnails-info">
4 Photos
</li>
</ul>
<div id="photo-count">
4 Photos - Click to enlarge
</div>
</div>
但我得到了一个错误:
Traceback (most recent call last):
File "/Users/evilslab/Documents/Websites/www.futurepoint.dev.cc/dobuyme/SCRAPE/boats.py", line 47, in <module>
images = soup.find("div", {"id": ["photos"]}).find_all("a")
AttributeError: 'NoneType' object has no attribute 'find_all'
回溯(最近一次呼叫最后一次):
文件“/Users/evisplab/Documents/Websites/www.futurepoint.dev.cc/dobuyme/SCRAPE/boats.py”,第47行,in
images=soup.find(“div”,“id”:[“photos”]}).find_all(“a”)
AttributeError:“非类型”对象没有“全部查找”属性
如何仅从href标签获取url?您的代码更全面地适用于我(假设您的HTML为
HTML\u doc
):
但是,您的问题是,URL中的请求
返回的文本与您给出的HTML示例不一致。尽管您尝试提供随机用户代理,但服务器返回:
<li>You\'re a power user moving through this website with super-human speed.</li>\n <li>You\'ve disabled JavaScript in your web browser.</li>\n <li>A third-party browser plugin, such as Ghostery or NoScript, is preventing JavaScript from running. Additional information is available in this <a title=\'Third party browser plugins that block javascript\' href=\'http://ds.tl/help-third-party-plugins\' target=\'_blank\'>support article</a>.</li>\n </ul>\n </div>\n <p class="we-could-be-wrong" >\n We could be wrong, and sorry about that! Please complete the CAPTCHA below and we’ll get you back on dubizzle right away.
您是一个超级用户,以超人的速度浏览此网站。 \n您已禁用web浏览器中的JavaScript。 \n第三方浏览器插件(如Ghostery或NoScript)正在阻止JavaScript运行。更多信息可在此查看。 \n\n\n\n我们可能错了,对此表示抱歉!请完成下面的验证码,我们会马上让您回到dubizzle。
由于验证码是为了防止刮擦,我建议尊重管理员的意愿,不要刮擦它。也许有API?试试这个:
for item in soup.find_all('span'):
try:
link = item.find_all('a', href=True)[0].attrs.get('href', None)
except IndexError:
continue
else:
print(link)
输出
/watchlist/classified/baby-items/10832265/1/
/watchlist/classified/baby-items/10832265/1/
https://images.dubizzle.com/v1/files/eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJmbiI6ImYzYWdrZm8xcDBlai1EVUJJWlpMRSIsInciOlt7ImZuIjoiNWpldWk3cWZ6aWU2MS1EVUJJWlpMRSIsInMiOjUwLCJwIjoiY2VudGVyLGNlbnRlciIsImEiOjgwfV19.s1GmifnZr0_Bx4HG8RTR4puYcxN0asqAmnBvSpIExEI/image;p=main
https://images.dubizzle.com/v1/files/eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJmbiI6Imtmc3cxMWgzNTB2cTMtRFVCSVpaTEUiLCJ3IjpbeyJmbiI6IjVqZXVpN3FmemllNjEtRFVCSVpaTEUiLCJzIjo1MCwicCI6ImNlbnRlcixjZW50ZXIiLCJhIjo4MH1dfQ.Wo2YqPdWav8shtmyVO2AdisHmLX-ZLDAiskLPAmTSPU/image;p=main
page=requests.get(url,headers={'user-agent':user_-agent.random})soup=BeautifulSoup(page.text,'html.parser')url=“”,这意味着,没有办法做到这一点?你对假冒ipThey的看法如何?他们将我的ip列入黑名单?同样的错误。回溯(最近一次调用):文件“/Users/evilslab/Documents/Websites/www.futurepoint.dev.cc/dobuyme/SCRAP/boats.py”,第48行,在soup.find(“div”,“id”:[“photos”]})。find_all(“a”):AttributeError:“NoneType”对象没有属性“find_all”我更改了答案,试试看。否则请发送url,因为根据问题中的html,我无法复制您的errorurl=“”
for item in soup.find_all('span'):
try:
link = item.find_all('a', href=True)[0].attrs.get('href', None)
except IndexError:
continue
else:
print(link)
/watchlist/classified/baby-items/10832265/1/
/watchlist/classified/baby-items/10832265/1/
https://images.dubizzle.com/v1/files/eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJmbiI6ImYzYWdrZm8xcDBlai1EVUJJWlpMRSIsInciOlt7ImZuIjoiNWpldWk3cWZ6aWU2MS1EVUJJWlpMRSIsInMiOjUwLCJwIjoiY2VudGVyLGNlbnRlciIsImEiOjgwfV19.s1GmifnZr0_Bx4HG8RTR4puYcxN0asqAmnBvSpIExEI/image;p=main
https://images.dubizzle.com/v1/files/eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJmbiI6Imtmc3cxMWgzNTB2cTMtRFVCSVpaTEUiLCJ3IjpbeyJmbiI6IjVqZXVpN3FmemllNjEtRFVCSVpaTEUiLCJzIjo1MCwicCI6ImNlbnRlcixjZW50ZXIiLCJhIjo4MH1dfQ.Wo2YqPdWav8shtmyVO2AdisHmLX-ZLDAiskLPAmTSPU/image;p=main