Python 从BeautifulSoup中删除无关的div标记_Python_Beautifulsoup

Python 从BeautifulSoup中删除无关的div标记

python

Python 从BeautifulSoup中删除无关的div标记,python,beautifulsoup,Python,Beautifulsoup,我试图从一个网站上抓取文本，却不知道如何删除一个无关的div标签。代码如下所示： import requests from bs4 import BeautifulSoup team_urls = ['http://www.lyricsfreak.com/e/ed+sheeran/shape+of+you_21113143.html/', 'http://www.lyricsfreak.com/e/ed+sheeran/thinking+out+loud_21083784.h

我试图从一个网站上抓取文本，却不知道如何删除一个无关的div标签。代码如下所示：

import requests
from bs4 import BeautifulSoup

team_urls = 
     ['http://www.lyricsfreak.com/e/ed+sheeran/shape+of+you_21113143.html/',
   'http://www.lyricsfreak.com/e/ed+sheeran/thinking+out+loud_21083784.html',
   'http://www.lyricsfreak.com/e/ed+sheeran/photograph_21058341.html']

for url in team_urls:
     page = requests.get(url)
     soup = BeautifulSoup(page.text, 'html.parser')

     for e in soup.find_all('br'):
         e.replace_with('\n')

     lyrics = soup.find(class_='dn')

     print(lyrics)

这给了我一个输出：

<div class="dn" id="content_h">The club isn't the best place...

俱乐部不是最好的地方。。。

我想删除div标记。

您可以使用正则表达式

import requests
import re

from bs4 import BeautifulSoup

team_urls = ['http://www.lyricsfreak.com/e/ed+sheeran/shape+of+you_21113143.html/',
             'http://www.lyricsfreak.com/e/ed+sheeran/thinking+out+loud_21083784.html',
             'http://www.lyricsfreak.com/e/ed+sheeran/photograph_21058341.html']

for url in team_urls:
    page = requests.get(url)
    soup = BeautifulSoup(page.text, 'html.parser')

    for e in soup.find_all('br'):
        e.replace_with('\n')

    lyrics = soup.find(class_='dn')
    cleanr = re.compile('<.*?>')
    cleantext = re.sub(cleanr, '', lyrics.text)

    print(cleantext)

导入请求
进口稀土
从bs4导入BeautifulSoup
团队URL=['http://www.lyricsfreak.com/e/ed+sheeran/shape+of+you_211113143.html/'，
'http://www.lyricsfreak.com/e/ed+sheeran/thinking+out+loud_21083784.html'，
'http://www.lyricsfreak.com/e/ed+sheeran/photo_21058341.html']
对于团队url中的url：
page=请求.get（url）
soup=BeautifulSoup（page.text，'html.parser'）
对于汤中的e。查找所有（'br'）：
e、 将_替换为（“\n”）
歌词=soup.find（class='dn'）
cleanr=re.compile（“”）
cleantext=re.sub（cleanr'，歌词.text）
打印（纯文本）

这将删除<和>

通过使用python文档中提到的特殊字符

。（点。）在默认模式下，它匹配除换行符以外的任何字符。如果指定了点所有标志，则它匹配包括换行符在内的任何字符

* 使生成的RE与前一个RE的0个或多个重复匹配，重复次数尽可能多。ab*将与“a”、“ab”或“a”匹配，后跟任意数量的“b”

?？使结果RE与前面RE的0或1次重复匹配。ab？将与“a”或“ab”匹配

从

完整代码：

import requests
from bs4 import BeautifulSoup

urls = ['http://www.lyricsfreak.com/e/ed+sheeran/shape+of+you_21113143.html/',
        'http://www.lyricsfreak.com/e/ed+sheeran/thinking+out+loud_21083784.html',
        'http://www.lyricsfreak.com/e/ed+sheeran/photograph_21058341.html']

for url in urls:
    page = requests.get(url)
    page.encoding = 'utf-8'
    soup = BeautifulSoup(page.text, 'html.parser')

    div = soup.select_one('#content_h')

    for e in div.find_all('br'):
        e.replace_with('\n')

    lyrics = div.text
    print(lyrics)

请注意，有时使用了错误的编码：

我可能疯了，别介意

这就是我手动设置它的原因：

page.encoding='utf-8'

。提到这一情况的片段：

响应内容的编码完全基于HTTP头来确定，完全遵循RFC 2616。如果您可以利用非HTTP知识更好地猜测编码，那么应该在访问此属性之前适当地设置r.encoding

您可以从找到的div中获取文本，比如“str=lyris.text”，您根本不需要使用正则表达式<代码>歌词。文本就足够了：P字符串中没有与

匹配的内容，因此不需要调用

re.sub

。谢谢。这就是我想要它做的。但有一个问题：为什么div=soup.select_One（“#content_h”）执行我委托的lyris=soup.find（class='dn'）执行的操作？这是因为

content_h

是要针对的元素的id。是一个唯一标识符，因此如果要针对一个特定元素，通常最好使用它

是一个CSS选择器，您可以阅读更多内容。太棒了。谢谢这很有帮助。