Python 使用Beautiful Soup select或lxml xpath从html获取href_Python_Html_Xpath_Web Scraping_Beautifulsoup

Python 使用Beautiful Soup select或lxml xpath从html获取href

python html xpath web-scraping

Python 使用Beautiful Soup select或lxml xpath从html获取href,python,html,xpath,web-scraping,beautifulsoup,Python,Html,Xpath,Web Scraping,Beautifulsoup,例如，我正在烂西红柿网站上做一些网页抓取我正在将Python与漂亮的Soup和lxml模块一起使用我想提取电影信息，例如： -类型：戏剧、音乐和表演艺术导演：基里尔·塞雷布伦尼科夫作者：米哈伊尔·伊多夫、莉莉·伊多娃、伊万·卡皮托诺夫、基里尔·塞雷布伦尼科夫、娜塔莉亚·诺门科作者（链接）：/名人/michael_idov，/名人/lily_idova，/名人/ivan_kapitonov，/名人/kirill_serebrennikov，/名人/natalya_naumenko 我

例如，我正在烂西红柿网站上做一些网页抓取

我正在将Python与漂亮的Soup和lxml模块一起使用

我想提取电影信息，例如： -类型：戏剧、音乐和表演艺术

导演：基里尔·塞雷布伦尼科夫
作者：米哈伊尔·伊多夫、莉莉·伊多娃、伊万·卡皮托诺夫、基里尔·塞雷布伦尼科夫、娜塔莉亚·诺门科
作者（链接）：/名人/michael_idov，/名人/lily_idova，/名人/ivan_kapitonov，/名人/kirill_serebrennikov，/名人/natalya_naumenko

我检查了html页面，以获取路径上的指南：

                    <li class="meta-row clearfix">
                        <div class="meta-label subtle">Rating: </div>
                        <div class="meta-value">NR</div>
                    </li>


                    <li class="meta-row clearfix">
                        <div class="meta-label subtle">Genre: </div>
                        <div class="meta-value">

                                <a href="/browse/opening/?genres=9">Drama</a>, 

                                <a href="/browse/opening/?genres=12">Musical &amp; Performing Arts</a>

                        </div>
                    </li>


                    <li class="meta-row clearfix">
                        <div class="meta-label subtle">Directed By: </div>
                        <div class="meta-value">

                                <a href="/celebrity/kirill_serebrennikov">Kirill Serebrennikov</a>

                        </div>
                    </li>


                    <li class="meta-row clearfix">
                        <div class="meta-label subtle">Written By: </div>
                        <div class="meta-value">

                                <a href="/celebrity/michael_idov">Mikhail Idov</a>, 

                                <a href="/celebrity/lily_idova">Lili Idova</a>, 

                                <a href="/celebrity/ivan_kapitonov">Ivan Kapitonov</a>, 

                                <a href="/celebrity/kirill_serebrennikov">Kirill Serebrennikov</a>, 

                                <a href="/celebrity/natalya_naumenko">Natalya Naumenko</a>

                        </div>
                    </li>


                    <li class="meta-row clearfix">
                        <div class="meta-label subtle">In Theaters: </div>
                        <div class="meta-value">
                            <time datetime="2019-06-06T17:00:00-07:00">Jun 7, 2019</time>
                            <span style="text-transform:capitalize">&nbsp;limited</span>
                        </div>
                    </li>




                    <li class="meta-row clearfix">
                        <div class="meta-label subtle">Runtime: </div>
                        <div class="meta-value">
                            <time datetime="P126M">
                                126 minutes
                            </time>
                        </div>
                    </li>


                    <li class="meta-row clearfix">
                    <div class="meta-label subtle">Studio: </div>
                    <div class="meta-value">

                            <a href="http://sonypictures.ru/leto/" target="movie-studio">Gunpowder &amp; Sky</a>

                    </div>

            </li>

例如，对于编写者来说，由于我只需要元素上的文本，因此很容易获得：

page_content.select('div.meta-value')[3].getText()

或者使用xpart进行评级：

tree.xpath('//div[@class="meta-value"]/text()')[0]

对于所需的编写器链接（我遇到问题），要访问html块，我执行以下操作：

page_content.select('div.meta-value')[3]

其中：

<div class="meta-value">
<a href="/celebrity/michael_idov">Mikhail Idov</a>, 

                                <a href="/celebrity/lily_idova">Lili Idova</a>, 

                                <a href="/celebrity/ivan_kapitonov">Ivan Kapitonov</a>, 

                                <a href="/celebrity/kirill_serebrennikov">Kirill Serebrennikov</a>, 

                                <a href="/celebrity/natalya_naumenko">Natalya Naumenko</a>

给予：

<Element div at 0x2915a4c54a8>

我试过：

page_content.select('div.meta-value')[3].get('href')
tree.xpath('//div[@class="meta-value"]')[3].get('href')
tree.xpath('//div[@class="meta-value"]/@href')[3]

所有结果均为空或错误。有人能帮我吗

提前谢谢！

干杯

尝试以下脚本以获取您感兴趣的内容。确保通过使用不同的电影来测试这两个功能。我想他们俩都能生产出所需的产品。我试图避免任何硬编码的索引以内容为目标

使用css选择器：

import requests
from bs4 import BeautifulSoup

r = requests.get('https://www.rottentomatoes.com/m/leto')
soup = BeautifulSoup(r.text,'lxml')

directed = soup.select_one(".meta-row:contains('Directed By') > .meta-value > a").text
written = [item.text for item in soup.select(".meta-row:contains('Written By') > .meta-value > a")]
written_links = [item.get("href") for item in soup.select(".meta-row:contains('Written By') > .meta-value > a")]
print(directed,written,written_links)

使用xpath：

import requests
from lxml.html import fromstring

r = requests.get('https://www.rottentomatoes.com/m/leto')
root = fromstring(r.text)

directed = root.xpath("//*[contains(.,'Directed By')]/parent::*/*[@class='meta-value']/a/text()")
written = root.xpath("//*[contains(.,'Written By')]/parent::*/*[@class='meta-value']/a/text()")
written_links = root.xpath(".//*[contains(.,'Written By')]/parent::*/*[@class='meta-value']/a//@href")
print(directed,written,written_links)

在cast中，我使用了列表理解，这样我就可以在单个元素上使用

.strip（）

来消除空白<代码>规范化-space（）是实现这一点的理想选项

cast = [item.strip() for item in root.xpath("//*[contains(@class,'cast-item')]//a/span[@title]/text()")]

看看这个答案：谢谢。这将返回页面中的所有HREF，但是我只需要页面内容部分的HREF。选择（'div.meta-value'）[3]有什么提示吗？我尝试了一些类似的方法，但没有成功：对于一个in-page_content.select（'div.meta-value'）[2]：print（“找到URL:”，a['href']）xpath工作得非常完美，非常感谢。不管怎样，我不知道你是怎么得到它的。有没有关于语法的简短解释（我不想问太多）或者一些你建议检查的资源？再次感谢！首先查看的内容，以基本了解如何创建相对xpaths.SIM，如果要求不高，请您再帮我一个好吗？我真的不懂语法。为了得到演员名单（Teo Yoo、Irina Starshenbaum等），我仔细查看了html和你发给我的指南，但我找不到有用的东西。你能帮我一下吗？我真的试过了。谢谢编辑后加入了cast@spcvalente。你真是太棒了！我很乐意付给你一杯啤酒。我认为有了这些例子，我就能够适应并将其用于其他页面。非常感谢你。

import requests
from bs4 import BeautifulSoup

r = requests.get('https://www.rottentomatoes.com/m/leto')
soup = BeautifulSoup(r.text,'lxml')

directed = soup.select_one(".meta-row:contains('Directed By') > .meta-value > a").text
written = [item.text for item in soup.select(".meta-row:contains('Written By') > .meta-value > a")]
written_links = [item.get("href") for item in soup.select(".meta-row:contains('Written By') > .meta-value > a")]
print(directed,written,written_links)

import requests
from lxml.html import fromstring

r = requests.get('https://www.rottentomatoes.com/m/leto')
root = fromstring(r.text)

directed = root.xpath("//*[contains(.,'Directed By')]/parent::*/*[@class='meta-value']/a/text()")
written = root.xpath("//*[contains(.,'Written By')]/parent::*/*[@class='meta-value']/a/text()")
written_links = root.xpath(".//*[contains(.,'Written By')]/parent::*/*[@class='meta-value']/a//@href")
print(directed,written,written_links)

cast = [item.strip() for item in root.xpath("//*[contains(@class,'cast-item')]//a/span[@title]/text()")]