Python 从BeautifulSoup中删除无关的div标记
我试图从一个网站上抓取文本,却不知道如何删除一个无关的div标签。代码如下所示:Python 从BeautifulSoup中删除无关的div标记,python,beautifulsoup,Python,Beautifulsoup,我试图从一个网站上抓取文本,却不知道如何删除一个无关的div标签。代码如下所示: import requests from bs4 import BeautifulSoup team_urls = ['http://www.lyricsfreak.com/e/ed+sheeran/shape+of+you_21113143.html/', 'http://www.lyricsfreak.com/e/ed+sheeran/thinking+out+loud_21083784.h
import requests
from bs4 import BeautifulSoup
team_urls =
['http://www.lyricsfreak.com/e/ed+sheeran/shape+of+you_21113143.html/',
'http://www.lyricsfreak.com/e/ed+sheeran/thinking+out+loud_21083784.html',
'http://www.lyricsfreak.com/e/ed+sheeran/photograph_21058341.html']
for url in team_urls:
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
for e in soup.find_all('br'):
e.replace_with('\n')
lyrics = soup.find(class_='dn')
print(lyrics)
这给了我一个输出:
<div class="dn" id="content_h">The club isn't the best place...
俱乐部不是最好的地方。。。
我想删除div标记。您可以使用正则表达式
import requests
import re
from bs4 import BeautifulSoup
team_urls = ['http://www.lyricsfreak.com/e/ed+sheeran/shape+of+you_21113143.html/',
'http://www.lyricsfreak.com/e/ed+sheeran/thinking+out+loud_21083784.html',
'http://www.lyricsfreak.com/e/ed+sheeran/photograph_21058341.html']
for url in team_urls:
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
for e in soup.find_all('br'):
e.replace_with('\n')
lyrics = soup.find(class_='dn')
cleanr = re.compile('<.*?>')
cleantext = re.sub(cleanr, '', lyrics.text)
print(cleantext)
导入请求
进口稀土
从bs4导入BeautifulSoup
团队URL=['http://www.lyricsfreak.com/e/ed+sheeran/shape+of+you_211113143.html/',
'http://www.lyricsfreak.com/e/ed+sheeran/thinking+out+loud_21083784.html',
'http://www.lyricsfreak.com/e/ed+sheeran/photo_21058341.html']
对于团队url中的url:
page=请求.get(url)
soup=BeautifulSoup(page.text,'html.parser')
对于汤中的e。查找所有('br'):
e、 将_替换为(“\n”)
歌词=soup.find(class='dn')
cleanr=re.compile(“”)
cleantext=re.sub(cleanr',歌词.text)
打印(纯文本)
这将删除<和>
通过使用python文档中提到的特殊字符
"
。
(点。)在默认模式下,它匹配除换行符以外的任何字符。如果指定了点所有标志,则它匹配包括换行符在内的任何字符
*
使生成的RE与前一个RE的0个或多个重复匹配,重复次数尽可能多。ab*将与“a”、“ab”或“a”匹配,后跟任意数量的“b”
??
使结果RE与前面RE的0或1次重复匹配。ab?将与“a”或“ab”匹配
"
从完整代码:
import requests
from bs4 import BeautifulSoup
urls = ['http://www.lyricsfreak.com/e/ed+sheeran/shape+of+you_21113143.html/',
'http://www.lyricsfreak.com/e/ed+sheeran/thinking+out+loud_21083784.html',
'http://www.lyricsfreak.com/e/ed+sheeran/photograph_21058341.html']
for url in urls:
page = requests.get(url)
page.encoding = 'utf-8'
soup = BeautifulSoup(page.text, 'html.parser')
div = soup.select_one('#content_h')
for e in div.find_all('br'):
e.replace_with('\n')
lyrics = div.text
print(lyrics)
请注意,有时使用了错误的编码:
我可能疯了,别介意
这就是我手动设置它的原因:page.encoding='utf-8'
。提到这一情况的片段:
响应内容的编码完全基于HTTP头来确定,完全遵循RFC 2616。如果您可以利用非HTTP知识更好地猜测编码,那么应该在访问此属性之前适当地设置r.encoding
您可以从找到的div中获取文本,比如“str=lyris.text”,您根本不需要使用正则表达式<代码>歌词。文本就足够了:P字符串中没有与
匹配的内容,因此不需要调用re.sub
。谢谢。这就是我想要它做的。但有一个问题:为什么div=soup.select_One(“#content_h”)执行我委托的lyris=soup.find(class='dn')执行的操作?这是因为content_h
是要针对的元素的id。是一个唯一标识符,因此如果要针对一个特定元素,通常最好使用它#
是一个CSS选择器,您可以阅读更多内容。太棒了。谢谢这很有帮助。