Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/html/83.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 漂亮的汤-获取所有文本,但保留链接html?_Python_Html_Parsing_Beautifulsoup - Fatal编程技术网

Python 漂亮的汤-获取所有文本,但保留链接html?

Python 漂亮的汤-获取所有文本,但保留链接html?,python,html,parsing,beautifulsoup,Python,Html,Parsing,Beautifulsoup,我必须处理一个非常混乱的HTML的大档案,里面充满了无关的表格、跨距和内联样式,并将其标记下来 我正试图使用来完成这项任务,我的目标基本上是get_text()函数的输出,除了用href完整地保留锚定标记 例如,我想转换为: <td> <font><span>Hello</span><span>World</span></font><br> <span>Foo Bar &l

我必须处理一个非常混乱的HTML的大档案,里面充满了无关的表格、跨距和内联样式,并将其标记下来

我正试图使用来完成这项任务,我的目标基本上是
get_text()
函数的输出,除了用
href
完整地保留锚定标记

例如,我想转换为:

<td>
    <font><span>Hello</span><span>World</span></font><br>
    <span>Foo Bar <span>Baz</span></span><br>
    <span>Example Link: <a href="https://google.com" target="_blank" style="mso-line-height-rule: exactly;-ms-text-size-adjust: 100%;-webkit-text-size-adjust: 100%;color: #395c99;font-weight: normal;text-decoration: underline;">Google</a></span>
</td>
当解析器沿树向下移动时,返回多个片段/重复:

HelloWorldFoo Bar BazExample Link: Google
HelloWorldFoo Bar BazExample Link: Google
HelloWorldFoo Bar BazExample Link: Google
HelloWorld
Hello
World

Foo Bar Baz
Baz

Example Link: Google
<a href='https://google.com'>Google</a>
HelloWorldFoo工具栏示例链接:谷歌
HelloWorldFoo酒吧BazExample链接:谷歌
HelloWorldFoo酒吧BazExample链接:谷歌
你好世界
你好
世界
福吧巴兹酒店
巴兹
示例链接:谷歌

<>代码> > P>只考虑直接子集递归= false,则需要处理每个“TD”,并分别提取文本和锚点链接。p>
#!/usr/bin/env python
from bs4 import BeautifulSoup

example_html = '<td><font><span>Some Example Text</span></font><br><span>Another Example Text</span><br><span>Example Link: <a href="https://google.com" target="_blank" style="mso-line-height-rule: exactly;-ms-text-size-adjust: 100%;-webkit-text-size-adjust: 100%;color: #395c99;font-weight: normal;text-decoration: underline;">Google</a></span></td>'

soup = BeautifulSoup(example_html, 'lxml')
tags = soup.find_all(recursive=False)
for tag in tags:
    print(tag.text)
    print(tag.find('a'))
for tag in tags:
    spans = tag.find_all('span')
    for span in spans:
        print(span.text)
print(tag.find('a'))

解决此问题的一种可能方法是,在打印元素的文本时,为
a
元素引入一些特殊处理

您可以通过重写
\u all_strings()
方法并返回
a
子元素的字符串表示形式,并跳过
a
元素中的可导航字符串来完成此操作。大致如下:

from bs4 import BeautifulSoup, NavigableString, CData, Tag


class MyBeautifulSoup(BeautifulSoup):
    def _all_strings(self, strip=False, types=(NavigableString, CData)):
        for descendant in self.descendants:
            # return "a" string representation if we encounter it
            if isinstance(descendant, Tag) and descendant.name == 'a':
                yield str(descendant)

            # skip an inner text node inside "a"
            if isinstance(descendant, NavigableString) and descendant.parent.name == 'a':
                continue

            # default behavior
            if (
                (types is None and not isinstance(descendant, NavigableString))
                or
                (types is not None and type(descendant) not in types)):
                continue

            if strip:
                descendant = descendant.strip()
                if len(descendant) == 0:
                    continue
            yield descendant
演示:

[1]中的
:data=“”
...: 
…:HelloWorld
…:Foo Bar Baz
…:示例链接: ...: ...: """ 在[2]中:soup=mybeautifulsou(数据,“lxml”) 在[3]中:打印(soup.get_text()) 你好世界 福吧巴兹酒店 示例链接:
是否也要删除样式和其他链接属性??因为您的输入和输出涉及到这一点,非常感谢,这是一个灵活、优雅的解决方案,我从未想到过。我对
a
标记的处理做了一个小小的调整,以便按照我的要求输出,这很完美。它怎么可能只返回href属性呢?例如:
示例链接:
回答我自己的问题,只需将:
if-isinstance(genderant,Tag)和genderant.name='a':yield str(genderant)
更改为:
if-isinstance(genderant,Tag)和genderant.name='a':yield str('.format(genderant.get('href','')
for tag in tags:
    spans = tag.find_all('span')
    for span in spans:
        print(span.text)
print(tag.find('a'))
from bs4 import BeautifulSoup, NavigableString, CData, Tag


class MyBeautifulSoup(BeautifulSoup):
    def _all_strings(self, strip=False, types=(NavigableString, CData)):
        for descendant in self.descendants:
            # return "a" string representation if we encounter it
            if isinstance(descendant, Tag) and descendant.name == 'a':
                yield str(descendant)

            # skip an inner text node inside "a"
            if isinstance(descendant, NavigableString) and descendant.parent.name == 'a':
                continue

            # default behavior
            if (
                (types is None and not isinstance(descendant, NavigableString))
                or
                (types is not None and type(descendant) not in types)):
                continue

            if strip:
                descendant = descendant.strip()
                if len(descendant) == 0:
                    continue
            yield descendant
In [1]: data = """
   ...: <td>
   ...:     <font><span>Hello</span><span>World</span></font><br>
   ...:     <span>Foo Bar <span>Baz</span></span><br>
   ...:     <span>Example Link: <a href="https://google.com" target="_blank" style="mso-line-height-rule: exactly;-ms-text-size-adjust: 100%;-webkit-text-size-adjust: 100%;color: #395c99;font-weight: normal;tex
   ...: t-decoration: underline;">Google</a></span>
   ...: </td>
   ...: """

In [2]: soup = MyBeautifulSoup(data, "lxml")

In [3]: print(soup.get_text())

HelloWorld
Foo Bar Baz
Example Link: <a href="https://google.com" style="mso-line-height-rule: exactly;-ms-text-size-adjust: 100%;-webkit-text-size-adjust: 100%;color: #395c99;font-weight: normal;text-decoration: underline;" target="_blank">Google</a>