Python 3.x 如何使用Jupyter从网站中提取文本？_Python 3.x_Beautifulsoup

Python 3.x 如何使用Jupyter从网站中提取文本？

python-3.x

Python 3.x 如何使用Jupyter从网站中提取文本？,python-3.x,beautifulsoup,Python 3.x,Beautifulsoup,我试图从一个链接中获取一篇文章的文本，但在导入文本时，我会获取所有其他链接、广告链接和图像名称，我不需要这些链接来进行分析 import re from nltk import word_tokenize, sent_tokenize, ngrams from collections import Counter from urllib import request from bs4 import BeautifulSoup url = "https://www.yahoo.com/news/b

我试图从一个链接中获取一篇文章的文本，但在导入文本时，我会获取所有其他链接、广告链接和图像名称，我不需要这些链接来进行分析

import re
from nltk import word_tokenize, sent_tokenize, ngrams
from collections import Counter
from urllib import request
from bs4 import BeautifulSoup
url = "https://www.yahoo.com/news/best-bites-weeknight-meals-cauliflower- 
 120000419.html" #this is the link 
html = request.urlopen(url).read().decode('utf8')
raw = BeautifulSoup(html,"lxml").get_text()
raw

我得到了这个结果（只复制了几行，我也得到了一篇文章的实际文本，但存在于其他行中）：

window.performance&&window.performance.mark&& window.performance.mark（\'PageStart\'）；最佳咬口：工作日餐菜花蔬菜炒饭！函数（s，f，p）{var a=[]，e={{u version:“3.6.0”，{u配置：{classPrefix:”，EnableClass:！0，enableJSClass:！0，usePrefixes:！0}，{u q:[]，on:函数（e，t）{var n=this；setTimeout（function（）{t（n[e]）}，0}，addTest:function（e，t，n）{a.push（{name:e，fn:t，options:n}）}，addAsyncTest:function（e）{a.push（{name:null，fn:e}）}，l=function（）{}；l.prototype=e，l=new l、 var c=[]；函数v（e，t）{return typeof e===t}var t=“Moz O ms Webkit”，u=e.\u配置

我只是想知道是否有任何方法可以让我只提取文章的文本，而忽略所有这些值。

当BS4解析站点时，它会在内部创建自己的DOM作为对象

要访问DOM的不同部分，我们必须使用正确的访问器或标记，如下所示

import re
from collections import Counter
from urllib import request
from bs4 import BeautifulSoup

url = "https://www.yahoo.com/news/best-bites-weeknight-meals-cauliflower-120000419.html" #this is the link
html = request.urlopen(url).read().decode('utf8')
parsedHTML = BeautifulSoup(html, "html.parser")
readableText = parsedHTML.article.get_text() # <- we got the text from inside the <article> tag 
print(readableText)

重新导入
从收款进口柜台
从urllib导入请求
从bs4导入BeautifulSoup
url=”https://www.yahoo.com/news/best-bites-weeknight-meals-cauliflower-120000419.html“#这是链接
html=request.urlopen（url.read（）.decode（'utf8'））
parsedHTML=BeautifulSoup（html，“html.parser”）
readableText=parsedHTML.article.get_text（）#很好的解决方案！感谢Definity的帮助！如果链接没有.html，你知道还有其他方法吗？取决于你的意思。据我所知，BS4可以解析XML/html。因此，页面必须是html或XML才能使用标记遍历DOM。但是，如果你是指联机页面，那么它们不必以.html结尾r如果你能在99%的时间里用浏览器阅读它，它将是HTML。