使用Python模块用纯文本替换标记_Python_Html Content Extraction

使用Python模块用纯文本替换标记

python

使用Python模块用纯文本替换标记,python,html-content-extraction,Python,Html Content Extraction,我正在使用从网页中提取“内容”。我知道以前有人问过这个问题，他们都被指了指漂亮的汤，我就是这样开始的我能够成功地获得大部分内容，但我在内容的一部分标签方面遇到了一些挑战。我从一个基本策略开始：如果一个节点中有多个x字符，那么它就是内容。让我们以下面的html代码为例： <div id="abc"> some long text goes <a href="/"> here </a> and hopefully it will get pic

我正在使用从网页中提取“内容”。我知道以前有人问过这个问题，他们都被指了指漂亮的汤，我就是这样开始的

我能够成功地获得大部分内容，但我在内容的一部分标签方面遇到了一些挑战。我从一个基本策略开始：如果一个节点中有多个x字符，那么它就是内容。让我们以下面的html代码为例：

<div id="abc">
    some long text goes <a href="/"> here </a> and hopefully it 
    will get picked up by the parser as content
</div>

results = soup.findAll(text=lambda(x): len(x) > 20)

上述方法不起作用，因为Beautiful Soup将字符串作为NavigableString插入，当我使用lenx>20的findAll时，这会导致相同的问题。我可以使用正则表达式首先将html解析为纯文本，清除所有不需要的标记，然后调用Beauty Soup。但我希望避免对同一内容进行两次处理——我正在尝试解析这些页面，以便能够显示给定链接的内容片段，非常类似于Facebook共享——如果所有内容都是用Beautiful Soup完成的，我想这会更快

所以我的问题是：有没有一种方法可以“清除标记”并用“纯文本”替换它们，使用BeautifulSoup。如果没有，最好的方法是什么

谢谢你的建议

更新：Alex的代码对于示例非常有效。我还尝试了各种边缘案例，它们都通过下面的修改工作得很好。所以我在一个现实生活的网站上试一试，我遇到了一些困扰我的问题

import urllib
from BeautifulSoup import BeautifulSoup

page = urllib.urlopen('http://www.engadget.com/2010/01/12/kingston-ssdnow-v-dips-to-30gb-size-lower-price/')

anchors = soup.findAll('a')
i = 0
for a in anchors:
    print str(i) + ":" + str(a)
    for a in anchors:
        if (a.string is None): a.string = ''
        if (a.previousSibling is None and a.nextSibling is None):
            a.previousSibling = a.string
        elif (a.previousSibling is None and a.nextSibling is not None):
            a.nextSibling.replaceWith(a.string + a.nextSibling)
        elif (a.previousSibling is not None and a.nextSibling is None):
            a.previousSibling.replaceWith(a.previousSibling + a.string)
        else:
            a.previousSibling.replaceWith(a.previousSibling + a.string + a.nextSibling)
            a.nextSibling.extract()
    i = i+1

当我运行上述代码时，我得到以下错误：

0:<a href="http://www.switched.com/category/ces-2010">Stay up to date with 
Switched's CES 2010 coverage</a>
Traceback (most recent call last):
  File "parselink.py", line 44, in <module>
  a.previousSibling.replaceWith(a.previousSibling + a.string + a.nextSibling)
 TypeError: unsupported operand type(s) for +: 'Tag' and 'NavigableString'

当我看到HTML代码时，“保持最新。”。。没有任何以前的兄弟姐妹我不知道以前的兄弟姐妹是如何工作的，直到我看到Alex的代码，根据我的测试，它看起来像是在标记前寻找“文本”。因此，如果没有以前的兄弟姐妹，我很惊讶它没有通过a的if逻辑；下一步是无

你能告诉我我做错了什么吗

-ecognium

适用于您的具体示例的方法是：

from BeautifulSoup import BeautifulSoup

ht = '''
<div id="abc">
    some long text goes <a href="/"> here </a> and hopefully it 
    will get picked up by the parser as content
</div>
'''
soup = BeautifulSoup(ht)

anchors = soup.findAll('a')
for a in anchors:
  a.previousSibling.replaceWith(a.previousSibling + a.string)

results = soup.findAll(text=lambda(x): len(x) > 20)

print results

当然，您可能需要更加小心，例如，如果没有a.string，或者如果a.previousSibling没有，您将需要合适的if语句来处理这种情况。但我希望这个总的想法能对你有所帮助。事实上，你可能也想合并下一个兄弟姐妹，如果它是一个字符串——不确定你的试卷Lexx＞20是怎么玩的，但是比如说，你有两个9个字符字符串，中间包含5个字符字符串，也许你想把这个词作为23个字符串来拾取？我说不出来，因为我不明白你启发的动机

我想除了标签，您还需要删除其他标签，例如或，可能和/或，等等。。。？我猜这也取决于你的启发背后的实际想法

当我试图展平文档中的标记时，这样，标记的整个内容就会被拉到它的父节点上。我想减少包含所有子段落、列表、div和span等的p标记的内容，但要去掉样式和字体标记以及一些可怕的单词到html生成器的残余，我发现处理BeautifulSoup本身相当复杂，因为extract也会删除内容，而replaceWith不幸地不接受None作为参数。经过一些疯狂的递归实验，我最终决定在使用BeautifulSoup处理文档之前或之后使用正则表达式，方法如下：

import re
def flatten_tags(s, tags):
   pattern = re.compile(r"<(( )*|/?)(%s)(([^<>]*=\\\".*\\\")*|[^<>]*)/?>"%(isinstance(tags, basestring) and tags or "|".join(tags)))
   return pattern.sub("", s)

tags参数可以是单个标记，也可以是要展平的标记列表。

非常感谢，Alex。您的代码对于我发布的示例的许多组合都非常有效。然而，当我在真实的网站上运行它时，我会得到奇怪的结果。我不知道我做错了什么！我只是用我的新代码更新了帖子。非常感谢你的帮助。你是对的，我想把所有的文本合并成一个巨大的字符串。我基本上是想得到一个页面的“内容”部分，这样我就可以显示一个摘要。您也是对的，我最终将不得不处理所有其他标记，例如，@Ecognium，您遇到的具体问题是，上一个或下一个兄弟确实存在，但立即是一个标记，而不是字符串-在这种情况下，您不能将其与字符串连接，因此在这种情况下，您基本上应该跳过，即。，不要修改！。要处理多个标记，请确保对它们进行迭代，以便对所有要删除的标记以及仅要删除的标记使用选择器函数返回True。@Alex，再次感谢。这是有道理的。我添加了一些实例检查，以忽略前一个同级是否是标记，但即使这样也会导致问题。我将进行更多调试，并尝试找出问题所在。非常感谢您抽出时间。

$ python bs.py
[u'\n    some long text goes  here ', u' and hopefully it \n    will get picked up by the parser as content\n']

import re
def flatten_tags(s, tags):
   pattern = re.compile(r"<(( )*|/?)(%s)(([^<>]*=\\\".*\\\")*|[^<>]*)/?>"%(isinstance(tags, basestring) and tags or "|".join(tags)))
   return pattern.sub("", s)