Python 3.x 如何提取本文的正文?

Python 3.x 如何提取本文的正文?,python-3.x,beautifulsoup,Python 3.x,Beautifulsoup,我怎样才能只获得与文章相关的文本?我不想要随机的东西 from urllib.request import urlopen as uReq from bs4 import BeautifulSoup as soup test1 = 'https://www.sfchronicle.com/news/bayarea/heatherknight/article/Special-education-teacher-a-prime-example-of-13560483.php' # Opening

我怎样才能只获得与文章相关的文本?我不想要随机的东西

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup

test1 = 'https://www.sfchronicle.com/news/bayarea/heatherknight/article/Special-education-teacher-a-prime-example-of-13560483.php'

# Opening up the connection, grabbing the page
uClient = uReq(test1)
page_html = uClient.read()
uClient.close()

# HTML parsing
page_soup = soup(page_html, "html.parser")
#print(page_soup.prettify())

# text of article
text = page_soup.find_all('p')
print(text)

您需要做的是循环浏览页面。查找所有('p')


您可以尝试打印(''.join([t.text代表t in text])是否有一种方法可以删除大多数文章都适用的不相关的“p”?如果我不想要最后一个'p',我怎么能做类似于(p.next_sibling-1)的事情呢?如果你不想要最后一个'p',你可以修改为'for p in page_soup.find_all('p')[:-1]:'
     for p in page_soup.find_all('p'):
          print (p.text, p.next_sibling)