Python 美丽的一对残缺的孩子，带着“发现一切”_Python_Html_Web Scraping_Beautifulsoup_Findall

Python 美丽的一对残缺的孩子，带着“发现一切”

python html web-scraping

Python 美丽的一对残缺的孩子，带着“发现一切”,python,html,web-scraping,beautifulsoup,findall,Python,Html,Web Scraping,Beautifulsoup,Findall,我试图从以下HTML脚本中删除“product tech section row”类下的嵌套div实例： <h2 class="product-tech-section-title"> Présentation de la TV SAMSUNG UE49MU9005</h2> <div class="product-tech-section-row"> <div> Désignation</b> :

我试图从以下HTML脚本中删除“product tech section row”类下的嵌套div实例：

<h2 class="product-tech-section-title">
    Présentation de la TV SAMSUNG UE49MU9005</h2>

<div class="product-tech-section-row">
    <div>
        Désignation</b> :
    </div>
    <div>
        <b>SAMSUNG UE49MU9005</b> (UE 49MU9005 TXXC)<br><br>Plus d'informations sur les <a             href="http://www.lcd-compare.com/info-tv-led-samsung.htm" title="TV Samsung : informations et statistiques">TV LED Samsung</a><br><a href="http://www.lcd-compare.com/tv-liste-122.htm?tv_label=7,8" title="Liste des TV 4K">Voir les TV 4K (Ultra HD ou Quad HD)</a></div>
</div>


<div class="product-tech-section-row">
    <div>
        Date de sortie (approx.)</b> :
    </div>
    <div>
        Mars 2017</div>
</div>


三星UE49MU9005电视节目表
签名：
三星UE49MU9005（UE49MU9005TXXC）

加上地面信息

出动日期（约）：
2017年火星

但是，使用find_all（）将只提取第一个div子级（仅Désignation，SAMSUNG UE…不显示），如下面的代码所示。我错过什么了吗？非常感谢您的帮助

from urllib.request import urlopen as uReq
from urllib.request import Request
from bs4 import BeautifulSoup as soup

#Allowing access to the website (personal use)
prod_url="http://www.lcd-compare.com/televiseur-SAMUE49MU9005-SAMSUNG-UE49MU9005.htm"
hdr = {'User-Agent': 'Mozilla/5.0'}
req = Request(prod_url,headers=hdr)
prod_html=uReq(req)

#Parsing the technical details
tec_list = prod_soup.find_all("div",{"class","product-tech-section-row"})

---------------------------------------------------------------------------------------
#However, this is what I am getting:
>>>print(tec_list[0])
<div class="product-tech-section-row">
<div>
Désignation</div></div>

>>>print(tec_list[0].findChildren())
[<div>
 Désignation<\div>]

从urllib.request导入urlopen作为uReq
从urllib.request导入请求
从bs4进口美汤作为汤
#允许访问网站（个人使用）
产品url=”http://www.lcd-compare.com/televiseur-SAMUE49MU9005-SAMSUNG-UE49MU9005.htm"
hdr={'User-Agent'：'Mozilla/5.0'}
req=请求（产品url，标题=hdr）
产品html=uReq（要求）
#解析技术细节
tec_list=prod_soup.find_all（“div”，{“class”，“产品技术部分行”}）
---------------------------------------------------------------------------------------
#然而，我得到的是：
>>>打印（技术列表[0]）
签名
>>>打印（技术列表[0]。findChildren（））
[
签署]

我认为，之所以不能删除嵌套元素，是因为您访问的网站大量使用Javascript呈现

我已经使用selenium验证了情况是否如此，并且我能够正常解析嵌套元素，没有任何问题

代码：

输出：

<div class="product-tech-section-row">
<div>
Désignation :
</div>
<div>
<b>SAMSUNG UE49MU9005</b> (UE 49MU9005 TXXC)<br/><br/>Plus d'informations sur les <a data-hasqtip="139" href="http://www.lcd-compare.com/info-tv-led-samsung.htm" oldtitle="TV Samsung : informations et statistiques" title="">TV LED Samsung</a><br/><a data-hasqtip="141" href="http://www.lcd-compare.com/tv-liste-122.htm?tv_label=7,8" oldtitle="Liste des TV 4K" title="">Voir les TV 4K (Ultra HD ou Quad HD)</a></div>
</div>


签名：
三星UE49MU9005（UE49MU9005 TXSC）

尝试打印（tec_列表[1]）这将得到“三星UE49MU9005”结果。请记住，find_all（）将返回一个包含疤痕元素的列表，该列表存储在tec_列表中。感谢您的回复，很遗憾，print（tec_列表[1]）将只返回“出动日期（大约）”，即以下“产品技术部分行”classHi p404。请检查下面的答案。谢谢！你的建议非常有效。顺便问一下，我想问你是否有其他一些库可以做同样的工作，但不涉及浏览器。这样就可以很容易地将此类代码添加到web API中。@p404，很抱歉回复太晚。我真的不知道还有哪家图书馆能实现你的目标。但是继续搜索。嗨，阿里，我做了一些研究，发现PhantomJS无头浏览器可以完成它。它也可以从SeleniumWebDriver加载，比如：driver=webdriver.PhantomJS（）。我希望你将来也能发现它的用处。谢谢你的回复，我对PhantomJS很新，但我想你想要一个硒的替代品。我很高兴你找到了解决办法。还有一种使用ChromeDriver运行无头浏览器的方法。该设置看起来非常有趣，感谢更新！

<div class="product-tech-section-row">
<div>
Désignation :
</div>
<div>
<b>SAMSUNG UE49MU9005</b> (UE 49MU9005 TXXC)<br/><br/>Plus d'informations sur les <a data-hasqtip="139" href="http://www.lcd-compare.com/info-tv-led-samsung.htm" oldtitle="TV Samsung : informations et statistiques" title="">TV LED Samsung</a><br/><a data-hasqtip="141" href="http://www.lcd-compare.com/tv-liste-122.htm?tv_label=7,8" oldtitle="Liste des TV 4K" title="">Voir les TV 4K (Ultra HD ou Quad HD)</a></div>
</div>