Python 仅获取HTML标记中的顶级文本_Python_Selenium_Web Scraping_Beautifulsoup

Python 仅获取HTML标记中的顶级文本

python selenium web-scraping

Python 仅获取HTML标记中的顶级文本,python,selenium,web-scraping,beautifulsoup,Python,Selenium,Web Scraping,Beautifulsoup,首先，我正在使用Python、Selenium和一些BeautifulSoup来进行Web垃圾处理。也许他们不能一起工作，但我至今也无法解决这个问题。我不认为这超出了人类的智慧，但这超出了我的努力以下是HTML： <div class="summary"> <div class="headingDate">09 January 2020 18:45 </div> <div class="callout"&

首先，我正在使用Python、Selenium和一些BeautifulSoup来进行Web垃圾处理。也许他们不能一起工作，但我至今也无法解决这个问题。我不认为这超出了人类的智慧，但这超出了我的努力

以下是HTML：

<div class="summary">
            <div class="headingDate">09 January 2020 18:45 </div>
            <div class="callout"><span class="grey">Bob Smith</span>Student of the Week - JANUARY </div>


        </div>

        <div class="body">
            January 2020

                <div class="boxContent">                    


<div class="third-small">
    <div class="dropzone drop-smaller dz-clickable" id="d-3d3361e5-1e47-403c-a6b5-10137143f994">
        <div class="dz-message" data-dz-message="">
            <p class="centre"><i class="far fa-image biggest"></i></p>
            <p class="centre">Drag and drop file here to attach</p>
            <span class="bigLink"><i class="fa fa-upload"></i> Or choose file</span>
        </div>


2020年1月9日18:45
鲍勃·史密斯本周最佳学生-一月
2020年1月

将文件拖放到此处以附加
或者选择文件

实际的HTML更进一步。基本上，“body”标签相当大，包含这个“third small”和其他类似的项目

我的问题似乎很简单：我只想从body标签上得到“2020年1月”。但我一直做不到。如果我使用BeautifulSoup的'gettext'，它会让我获得所有其他包含的文本（如'Drag and drop file here to attach'），而没有明显的分隔方式。是的，有一些新词，但也有那些在上面的文字，所以我觉得这不是一个安全的方式。我还使用了BeautifulSoup中的“find_all”，但这只是得到了所有包含的标记，其中不包括文本

有办法吗？我也尝试过使用Selenium方法，但运气不好。

来自bs4 import BeautifulSoup
html=”“”
2020年1月9日18:45
鲍勃·史密斯本周最佳学生-一月
2020年1月

将文件拖放到此处以附加
或者选择文件
"""
soup=BeautifulSoup（html，'html.parser'）
打印（soup.find（“div”，class=“body”）.contents[0].strip（））

输出：

January 2020