在python中的web抓取过程中,如何通过使用beautifulSoup只生成一个函数来访问不同博客上的文章?

在python中的web抓取过程中,如何通过使用beautifulSoup只生成一个函数来访问不同博客上的文章?,python,web-scraping,beautifulsoup,Python,Web Scraping,Beautifulsoup,第一篇博客帖子之一的html页面 我们在同一片天空下 你和我 我与你分享地球的灵魂 也写一首诗 我有话要说 一个微笑 你来对地方了 你活着,你留下,你移动,你玩耍 也可能有工作要做,有话要说 我们可能会互相交叉,也可能不会 但问题是,我们在这里 在这一瞬间;那又怎样,不太清楚 但强大的游戏还在继续 因为你可以写一首诗 广告 有时,您的一些访问者可能会在此处看到广告,以及页面底部的广告。您可以升级到我们的付费计划,完全隐藏广告 主要问题是删除不必要的广告和横幅。我制作了一个简单的函数scr

第一篇博客帖子之一的html页面


我们在同一片天空下

你和我

我与你分享地球的灵魂

也写一首诗

我有话要说

一个微笑

你来对地方了

你活着,你留下,你移动,你玩耍

也可能有工作要做,有话要说

我们可能会互相交叉,也可能不会

但问题是,我们在这里

在这一瞬间;那又怎样,不太清楚

但强大的游戏还在继续

因为你可以写一首诗

广告 有时,您的一些访问者可能会在此处看到广告,
以及页面底部的广告。
您可以升级到我们的付费计划,完全隐藏广告


主要问题是删除不必要的广告和横幅。我制作了一个简单的函数
scrape_data()
,在这个函数中,您提供了数据字符串,它将返回scraped内容:

data_1 = """
<div class="entry-content">
        <p>We are under the same sky.</p>
<p>You and I.</p>
<p>I share the soul of earth with you,</p>
<p>to contribute a verse too.</p>
<p>I have words to give,</p>
<p>a smile to offer.</p>
<p>You are at your right place.</p>
<p>You live ,you stay ,you move ,you play.</p>
<p>May also have works to do and words to say.</p>
<p>We may cross each other or not.</p>
<p>But the thing is, we are here,</p>
<p>in this instant;So what, not so clear.</p>
<p>But the powerful play goes on,</p>
<p>for you may contribute a verse.</p>
        <div id="wordads-preview-parent" class="wpcnt">
            <div class="wpa">
                <span class="wpa-about">Advertisements</span>
                <div class="u">
                    <div class="wpa-notice">
                        <p>Occasionally, some of your visitors may see an advertisement here, <br />as well as a <a href="https://en.support.wordpress.com/cookie-widget/" target="_blank">Privacy & Cookies banner</a> at the bottom of the page.<br/>You can hide ads completely by upgrading to one of our paid plans.</p>
                        <p class="wpa-buttons">
                            <a class="wpa-button is-primary" id="wordads-preview-more" href="https://wordpress.com/plans/141006071/?feature=no-adverts&utm_campaign=removeadsnotive" rel="nofollow" target="_blank">Upgrade now</a>
                            <a class="wpa-button" id="wordads-preview-dismiss" href="#">Dismiss message</a>
                        </p>
                    </div>
                </div>
            </div>
        </div>"""

data_2 = """
<div class="entry-content">
            <h2><span style="color:#000000;">There are lessons which aren&#8217;t taught</span></h2>
<h2><span style="color:#000000;">Everything black isn&#8217;t always dark<img data-attachment-id="38" data-permalink="https://awistfulwind.wordpress.com/2017/04/09/a-deeper-perspective/ea530f2a5c6b48821056deb178ed1747/" data-orig-file="https://awistfulwind.files.wordpress.com/2017/04/ea530f2a5c6b48821056deb178ed1747.jpg" data-orig-size="500,379" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="ea530f2a5c6b48821056deb178ed1747" data-image-description="" data-medium-file="https://awistfulwind.files.wordpress.com/2017/04/ea530f2a5c6b48821056deb178ed1747.jpg?w=328&#038;h=248" data-large-file="https://awistfulwind.files.wordpress.com/2017/04/ea530f2a5c6b48821056deb178ed1747.jpg?w=490" class="alignright  wp-image-38" src="https://awistfulwind.files.wordpress.com/2017/04/ea530f2a5c6b48821056deb178ed1747.jpg?w=328&#038;h=248" alt="ea530f2a5c6b48821056deb178ed1747" width="328" height="248" srcset="https://awistfulwind.files.wordpress.com/2017/04/ea530f2a5c6b48821056deb178ed1747.jpg?w=328&amp;h=248 328w, https://awistfulwind.files.wordpress.com/2017/04/ea530f2a5c6b48821056deb178ed1747.jpg?w=150&amp;h=114 150w, https://awistfulwind.files.wordpress.com/2017/04/ea530f2a5c6b48821056deb178ed1747.jpg?w=300&amp;h=227 300w, https://awistfulwind.files.wordpress.com/2017/04/ea530f2a5c6b48821056deb178ed1747.jpg 500w" sizes="(max-width: 328px) 100vw, 328px" /></span></h2>
<h2><span style="color:#000000;">Everything you love isn&#8217;t always desired</span></h2>
<h2><span style="color:#000000;">Everything you need isn&#8217;t always desired</span></h2>
<h2><span style="color:#000000;">Everything you look isn&#8217;t always watched</span></h2>
<h2><span style="color:#000000;">And everything you do isn&#8217;t always what u did.</span></h2>
<h2><span style="color:#ff0000;">REMEMBER!!!!!</span></h2>
<div id="jp-post-flair" class="sharedaddy sd-like-enabled sd-sharing-enabled"><div class="sharedaddy sd-sharing-enabled"><div class="robots-nocontent sd-block sd-social sd-social-icon-text sd-sharing"><h3 class="sd-title">Share this:</h3><div class="sd-content"><ul><li class="share-press-this"><a rel="nofollow" data-shared="" class="share-press-this sd-button share-icon" href="https://awistfulwind.wordpress.com/2017/04/09/a-deeper-perspective/?share=press-this" rel="noopener noreferrer" target="_blank" title="Click to Press This!"><span>Press This</span></a></li><li class="share-twitter"><a rel="nofollow" data-shared="sharing-twitter-27" class="share-twitter sd-button share-icon" href="https://awistfulwind.wordpress.com/2017/04/09/a-deeper-perspective/?share=twitter" rel="noopener noreferrer" target="_blank" title="Click to share on Twitter"><span>Twitter</span></a></li><li class="share-facebook"><a rel="nofollow" data-shared="sharing-facebook-27" class="share-facebook sd-button share-icon" href="https://awistfulwind.wordpress.com/2017/04/09/a-deeper-perspective/?share=facebook" rel="noopener noreferrer" target="_blank" title="Click to share on Facebook"><span>Facebook</span></a></li><li class="share-google-plus-1"><a rel="nofollow" data-shared="sharing-google-27" class="share-google-plus-1 sd-button share-icon" href="https://awistfulwind.wordpress.com/2017/04/09/a-deeper-perspective/?share=google-plus-1" rel="noopener noreferrer" target="_blank" title="Click to share on Google+"><span>Google</span></a></li><li class="share-end"></li></ul></div></div></div><div class='sharedaddy sd-block sd-like jetpack-likes-widget-wrapper jetpack-likes-widget-unloaded' id='like-post-wrapper-127135943-27-5b54d1ab0f8b1' data-src='//widgets.wp.com/likes/index.html?ver=20180319#blog_id=127135943&amp;post_id=27&amp;origin=awistfulwind.wordpress.com&amp;obj_id=127135943-27-5b54d1ab0f8b1' data-name='like-post-frame-127135943-27-5b54d1ab0f8b1'><h3 class='sd-title'>Like this:</h3><div class='likes-widget-placeholder post-likes-widget-placeholder' style='height: 55px;'><span class='button'><span>Like</span></span> <span class="loading">Loading...</span></div><span class='sd-text-color'></span><a class='sd-link-color'></a></div></div>        </div><!-- .entry-content -->
    </div><!-- .entry-body -->"""

from bs4 import BeautifulSoup

def scrape_data(data):
    soup = BeautifulSoup(data, 'lxml')
    # remvove advertisements
    for div in soup.select('div#wordads-preview-parent'):
        div.clear()
    for div in soup.select('div#jp-post-flair'):
        div.clear()
    return soup.select_one('.entry-content').text.strip()

print(scrape_data(data_1))
print('-' * 80)
print(scrape_data(data_2))
print('-' * 80)
We are under the same sky.
You and I.
I share the soul of earth with you,
to contribute a verse too.
I have words to give,
a smile to offer.
You are at your right place.
You live ,you stay ,you move ,you play.
May also have works to do and words to say.
We may cross each other or not.
But the thing is, we are here,
in this instant;So what, not so clear.
But the powerful play goes on,
for you may contribute a verse.
--------------------------------------------------------------------------------
There are lessons which aren’t taught
Everything black isn’t always dark
Everything you love isn’t always desired
Everything you need isn’t always desired
Everything you look isn’t always watched
And everything you do isn’t always what u did.
REMEMBER!!!!!
--------------------------------------------------------------------------------