使用beautifulsoup和python从html页面获取文本_Python_Html_Beautifulsoup

使用beautifulsoup和python从html页面获取文本

python html

使用beautifulsoup和python从html页面获取文本,python,html,beautifulsoup,Python,Html,Beautifulsoup,我需要抓取嵌套在HTML页面中的这部分文本 link: http://warframe.wikia.com/wiki/Frost text needed: Frost's component blueprints are acquired from Lieutenant Lech Kril & Captain Vor (Exta, Ceres). 我以前使用过bs4，但我不知道如何以任何方式提取此特定文本。此页面对网页抓取不太友好。我制作了一个函数get_text（），它接受两个参数

我需要抓取嵌套在HTML页面中的这部分文本

link: http://warframe.wikia.com/wiki/Frost

text needed: Frost's component blueprints are acquired from Lieutenant Lech Kril & Captain Vor (Exta, Ceres).

我以前使用过bs4，但我不知道如何以任何方式提取此特定文本。

此页面对网页抓取不太友好。我制作了一个函数

get_text（）

，它接受两个参数，

tag_from

和

tag_to

。它将清除这两个标记之间的所有文本：

from bs4 import BeautifulSoup, NavigableString
import requests

soup = BeautifulSoup(requests.get('http://warframe.wikia.com/wiki/Frost').text, 'lxml')

def get_text(tag_from, tag_to):
    rv = ''
    while True:
        s = tag_from.next_sibling
        if s == tag_to:
            break
        if isinstance(s, NavigableString):
            rv += s
        else:
            rv += s.text
        tag_from = tag_from.next_sibling
    return rv.strip()

s = get_text(soup.select_one('#Acquisition').parent, soup.select_one('#Acquisition').parent.find_next('table'))
print(s)

印刷品：

Frost's component blueprints are acquired from Lieutenant Lech Kril & Captain Vor (Exta, Ceres).

编辑：

在这个页面上，这个文本不容易定位，没有封装它的标签。因此，我的方法是从一个标记开始，从我找到的所有内容构建字符串，直到结束标记

一些内容是

NavigableString

（纯文本）类型，一些内容是其他标记（我使用

.text

属性从这些标记中获取字符串）。

您能发布到目前为止尝试的内容吗？谢谢！工作起来很有魅力！你能对你所做的做一点描述吗？所以我可以向你学习this@Khristian我添加了简短的描述，希望对您有所帮助。