如何在python中隔离网页的主文本？_Python_Web Scraping_Beautifulsoup

如何在python中隔离网页的主文本？

python web-scraping

如何在python中隔离网页的主文本？,python,web-scraping,beautifulsoup,Python,Web Scraping,Beautifulsoup,我必须保存一个网页的文本，并使用另一个功能来总结文本。问题是，我的摘要最终会出现来自网页中各种内容（如广告）的奇怪文本。我正在使用BeautifulSoup来提取文本。以下是文本提取的代码： def web_crawler(): userinput = str(input("Enter a valid Web Page URL: ")) url = urllib.urlopen(userinput).read() #add exception here for inter

我必须保存一个网页的文本，并使用另一个功能来总结文本。问题是，我的摘要最终会出现来自网页中各种内容（如广告）的奇怪文本。我正在使用BeautifulSoup来提取文本。以下是文本提取的代码：

def web_crawler():
    userinput = str(input("Enter a valid Web Page URL: "))
    url = urllib.urlopen(userinput).read()
    #add exception here for internet connection not avalaible
    soup = BeautifulSoup(url.decode('utf8'))
    [s.extract() for s in soup('script')]   #remove javascriptlinks
    [s.extract() for s in soup('style')]    #remove css
    [s.extract() for s in soup('a')]    # remove links
    title = str(soup.title).strip("<title>")
    title = title.strip("</")
    htmlText = soup.get_text()
    htmlText = ' '.join(htmlText.split())   #remove unnecessary whitspace
    textFile = open("textFile.txt", mode = "w", encoding = "utf8")
    textFile.write(htmlText)    #save text file to use in memory friendly version
    textFile.close()
    #for now return the title query and the article text
    return (title, htmlText)

例如，我想总结一下此网页的文本内容：

当我对文本进行总结时，我会从广告和侧面的功能中获得文本。有没有办法只从网页中抓取主体文本？

你是否尝试过找出哪些元素包含你想要的文本？如果你想对任何网页都这样做，这将是非常困难的，因为它们都是不同的。如果是针对特定网站上的页面，例如特定新闻网站上的文章，那么您可以查看页面源并尝试查找包含文本的元素的类或id。@IgnacioVazquez Abrams哦，我明白了。我需要的大部分文本将在标记中。bs4应该有一个从右边某个特定标签中提取文本的功能？我本来希望左边的div是这样的，它的id是xxx，但我想它也可以工作…@DanielGibbs我将尝试为任何网站使用标签上的findAll函数。这可能会排除不需要的文本。如果没有，我将使其特定于网站并使用div id