Python 除去网页上的文本以外的所有内容的最佳方法是什么？_Python

Python 除去网页上的文本以外的所有内容的最佳方法是什么？

python

Python 除去网页上的文本以外的所有内容的最佳方法是什么？,python,Python,我正在寻找一个html页面，只提取该页面上的纯文本。有人知道用python实现这一点的好方法吗我想从字面上去掉所有内容，只留下文章的文本和标签之间的任何其他文本。JS、css等。。。消失谢谢根据： def删除html标签（数据）： p=重新编译（r“”）返回p.sub（“”，数据）正如他在文章中指出的，“re模块需要导入才能使用正则表达式。”这里的第一个答案不会删除页面中的CSS或JavaScript标记体（未链接）。这可能会更接近： def stripTags(text): s

我正在寻找一个html页面，只提取该页面上的纯文本。有人知道用python实现这一点的好方法吗

我想从字面上去掉所有内容，只留下文章的文本和标签之间的任何其他文本。JS、css等。。。消失

谢谢

根据：

def删除html标签（数据）：
p=重新编译（r“”）
返回p.sub（“”，数据）

正如他在文章中指出的，“re模块需要导入才能使用正则表达式。”

这里的第一个答案不会删除页面中的CSS或JavaScript标记体（未链接）。这可能会更接近：

def stripTags(text):
  scripts = re.compile(r'<script.*?/script>')
  css = re.compile(r'<style.*?/style>')
  tags = re.compile(r'<.*?>')

  text = scripts.sub('', text)
  text = css.sub('', text)
  text = tags.sub('', text)

  return text

def条带标签（文本）：
脚本=重新编译（r“”）
css=重新编译（r“”）
标记=重新编译（r“”）
text=scripts.sub（“”，text）
text=css.sub（“”，text）
text=tags.sub（“”，text）
返回文本

该模块值得考虑。但是，删除CSS和JavaScript需要一些处理：

def stripsource(page):
    from lxml import html

    source = html.fromstring(page)
    for item in source.xpath("//style|//script|//comment()"):
        item.getparent().remove(item)

    for line in source.itertext():
        if line.strip():
            yield line

生成的行可以简单地连接起来，但这可能会丢失大量数据单词边界，如果空格周围没有空格，则生成标签

根据您的需求，您可能还希望只迭代

标记。

您可以尝试非常好的方法

但请注意：您从任何解析尝试中得到的结果都会受到“错误”的影响。糟糕的HTML、糟糕的解析和一般的意外输出。如果您的源文档是众所周知的，并且呈现得很好，那么您应该是可以的，或者至少能够解决其中的特殊性，但是如果它只是“在互联网上发现”的一般内容，那么您可能会遇到各种奇怪和奇妙的异常值

我也会推荐BeautifulSoup，但我会建议在答案上使用类似的内容，我将在这里复制给那些不想看的人：

soup = BeautifulSoup.BeautifulSoup(html)
texts = soup.findAll(text=True)

def visible(element):
    if element.parent.name in ['style', 'script', '[document]', 'head', 'title']:
        return False
    elif re.match('<!--.*-->', str(element)):
        return False
    return True

visible_texts = filter(visible, texts)

soup=BeautifulSoup.BeautifulSoup（html）
text=soup.findAll（text=True）
def可见（元素）：
如果['style'，'script'，'document'，'head'，'title']中的element.parent.name：
返回错误
elif re.match（“”，str（元素））：
返回错误
返回真值
可见文本=过滤器（可见，文本）

例如，我在这个页面上试用过，效果非常好。

这是我发现的剥离CSS和JavaScript的最干净、最简单的解决方案：

''.join(BeautifulSoup(content).findAll(text=lambda text: 
text.parent.name != "script" and 
text.parent.name != "style"))

通过

狼群会让你得到这个。是的，通常我反对使用正则表达式解析HTML，但这似乎是一个足够简单的方法。当然，它也会剥离代码示例。。。如果有的话。。。只是一个想法：）嗯-不摆脱javascript，只是标签。内联css定义也是如此。这不会像yahoo.com上那样剥离css、javascript或嵌入式内容。我尝试使用Beautiful soup，但有很高比例的时间它会因为糟糕的html而失效，这不是bueno

soup = BeautifulSoup.BeautifulSoup(html)
texts = soup.findAll(text=True)

def visible(element):
    if element.parent.name in ['style', 'script', '[document]', 'head', 'title']:
        return False
    elif re.match('<!--.*-->', str(element)):
        return False
    return True

visible_texts = filter(visible, texts)

''.join(BeautifulSoup(content).findAll(text=lambda text: 
text.parent.name != "script" and 
text.parent.name != "style"))