Python 如何使用标签提取标签中的文本？_Python_Beautifulsoup

Python 如何使用标签提取标签中的文本？

python

Python 如何使用标签提取标签中的文本？,python,beautifulsoup,Python,Beautifulsoup,我想使用beautifulsoup解析html页面。我想在不删除内部html标记的情况下提取标记内部的文本。例如，示例输入： <a class="fl" href="https://stackoverflow.com/questio..."> Angular2 <b>Router link not working</b> </a> 如何在不删除内部标记的情况下提取文本？在编写printsoup.text时，您正在从标记“a”中提取所有文本

我想使用beautifulsoup解析html页面。我想在不删除内部html标记的情况下提取标记内部的文本。例如，示例输入：

<a class="fl" href="https://stackoverflow.com/questio...">
    Angular2 <b>Router link not working</b>
</a>

如何在不删除内部标记的情况下提取文本？

在编写printsoup.text时，您正在从标记“a”中提取所有文本，包括其中的每个标记。如果要仅获取标记“b”对象，请尝试下一步：

soup = BeautifulSoup(string, 'html.parser')
b = soup.find('b')
print(b)
print(type(b))

或

输出：

路由器链路不工作如您所见，它将在beautifullsoup对象中返回标记“b”

如果您需要字符串格式的数据，只需编写：

b = soup.find('a', class_="fl").find('b')
b = str(b)
print(b)
print(type(b))

输出：

路由器链路不工作

在编写printsoup.text时，您正在从标记“a”提取所有文本，包括其中的每个标记。如果要仅获取标记“b”对象，请尝试下一步：

soup = BeautifulSoup(string, 'html.parser')
b = soup.find('b')
print(b)
print(type(b))

或

输出：

路由器链路不工作如您所见，它将在beautifullsoup对象中返回标记“b”

如果您需要字符串格式的数据，只需编写：

b = soup.find('a', class_="fl").find('b')
b = str(b)
print(b)
print(type(b))

输出：

路由器链路不工作

正如Den所说，您需要获取该内部标记，然后将其存储为str类型，以包含该内部标记。在Den给出的解决方案中，这将专门抓取标记，而不是父标记/文本，如果其中有其他样式类型的标记，也不会抓取。但是如果有其他标记，您可以更一般地让它查找标记的子元素，而不是专门查找标记

因此，基本上，这将做的是找到标签并抓取整个文本。然后它将进入该标记的子级，将其转换为字符串，然后用包含标记的字符串替换父文本中的文本

string = '''<a class="fl" href="https://stackoverflow.com/questio...">
     Angular2 <b>Router link not working</b> and then this is in <i>italics</i> and this is in <b>bold</b>
     </a>'''



from bs4 import BeautifulSoup, Tag

soup = BeautifulSoup(string, 'html.parser')
parsed_soup = ''

for item in soup.find_all('a'):
    if type(item) is Tag and 'a' != item.name:
        continue
    else:
        try:
            parent = item.text.strip()
            child_elements = item.findChildren()
            for child_ele in child_elements:
                child_text = child_ele.text
                child_str = str(child_ele)


                parent = parent.replace(child_text, child_str)
        except:
            parent = item.text

print (parent)

输出：

因此，基本上，这将做的是找到标签并抓取整个文本。然后它将进入该标记的子级，将其转换为字符串，然后用包含标记的字符串替换父文本中的文本

string = '''<a class="fl" href="https://stackoverflow.com/questio...">
     Angular2 <b>Router link not working</b> and then this is in <i>italics</i> and this is in <b>bold</b>
     </a>'''



from bs4 import BeautifulSoup, Tag

soup = BeautifulSoup(string, 'html.parser')
parsed_soup = ''

for item in soup.find_all('a'):
    if type(item) is Tag and 'a' != item.name:
        continue
    else:
        try:
            parent = item.text.strip()
            child_elements = item.findChildren()
            for child_ele in child_elements:
                child_text = child_ele.text
                child_str = str(child_ele)


                parent = parent.replace(child_text, child_str)
        except:
            parent = item.text

print (parent)

输出：

从第一个答案来看，效果很好。对于此示例：

from bs4 import Beautifulsoup
string = '<a class="fl" href="https://stackoverflow.com/questio...">
             Angular2 <b>Router link not working</b>
         </a>'
soup = BeautifulSoup(string, 'html.parser')
soup.find('a').encode_contents().decode('utf-8')

它给出：

'Angular2 <b>Router link not working</b>'

从第一个答案来看，效果很好。对于此示例：

from bs4 import Beautifulsoup
string = '<a class="fl" href="https://stackoverflow.com/questio...">
             Angular2 <b>Router link not working</b>
         </a>'
soup = BeautifulSoup(string, 'html.parser')
soup.find('a').encode_contents().decode('utf-8')

它给出：

'Angular2 <b>Router link not working</b>'

您是否尝试过不将解析器传递给Beautifulsoup构造函数，然后强制转换为字符串？这里已经回答：@helenej感谢您的回复。我试过了，但没有成功。它给出了一个。。。再一次。您是否尝试过不将解析器传递给Beautifulsoup构造函数，然后强制转换为字符串？这里已经回答：@helenej谢谢您的回复。我试过了，但没有成功。它给出了一个。。。同样，这个答案给出了本例中唯一的内部，并删除了文本的第一部分Angular2。我想保留整个文本及其内部标记。这个答案给出了唯一的内部标记，并删除了本例中文本的第一部分Angular2。我想保留整个文本及其内部标记。干得好@hamid。我试图使用.encode_内容，但它也返回了外部标记。我知道您必须指定.find'a以执行所需操作。感谢您发布您自己问题的解决方案，这非常有益！干得好，哈米德。我试图使用.encode_内容，但它也返回了外部标记。我知道您必须指定.find'a以执行所需操作。感谢您发布您自己问题的解决方案，这非常有益！