在具有相同名称的多个html标记之间进行刮取_Html_Python 3.x_Web Scraping_Beautifulsoup

在具有相同名称的多个html标记之间进行刮取

html python-3.x web-scraping

在具有相同名称的多个html标记之间进行刮取,html,python-3.x,web-scraping,beautifulsoup,Html,Python 3.x,Web Scraping,Beautifulsoup,我想在两个id相同的HTML标记之间提取HTML html = '''<div id="note"> <div id="seccion"> <a name="title">Title of the seccion 1</a> </div> <div id="content"> <d

我想在两个id相同的HTML标记之间提取HTML

html = '''<div id="note">

    <div id="seccion">
        <a name="title">Title of the seccion 1</a>
    </div>

    <div id="content">
        <div id="col1">xxx</div>
        <div id="col2">xxx</div>
    </div>

    <div id="content">
        <div id="col1">xxx</div>
        <div id="col2">xxx</div>
    </div>

    <div id="seccion">
        <a name="title">Title of the seccion 2</a>
    </div>

    <div id="block">
        <div id="col1">xxx</div>
        <div id="col2">xxx</div>
    </div>

    <div id="block">
        <div id="col1">xxx</div>
        <div id="col2">xxx</div>
    </div>

    <div id="seccion">
        <a name="title">Title of the seccion 3</a>
    </div>

    <div id="block">
        <div id="col1">xxx</div>
        <div id="col2">xxx</div>
    </div>

</div>'''

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')

seccion= soup.find_all("div", {"id": "seccion"})
for item in seccion:
    print([a.text for a in item.find_all("a", {"name": "title"})])

html=''
秘书长职务1
xxx
xxx
xxx
xxx
秘书长职务2
xxx
xxx
xxx
xxx
秘书长职务3
xxx
xxx
'''
从bs4导入BeautifulSoup
soup=BeautifulSoup（html，'html.parser'）
seccion=soup.find_all（“div”，“id”：“seccion”}）
对于seccion中的项目：
打印（[a.text用于in item.find_all（“a”，{“name”：“title”}）]）

不幸的是，分区并没有在一个div中分开，子分区将从中删除。我不知道有多少街区

我不确定在名称相同的情况下是否可以在两个div之间提取html。

您可以使用

分隔部分。使用参数recursive=False
查找\u all（）

，然后检查

是否包含

id=“seccion”

属性

例如：

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')

sections = []
for div in soup.select_one('div#note').find_all('div', recursive=False):
    if div.get('id') == 'seccion':
        sections.append([div])
    else:
        sections[-1].append(div)

for section in sections:
    for div in section:
        print(div.get_text(strip=True, separator='\n'))
    print('-' * 80)

分别打印三个部分：

Title of the seccion 1
xxx
xxx
xxx
xxx
--------------------------------------------------------------------------------
Title of the seccion 2
xxx
xxx
xxx
xxx
--------------------------------------------------------------------------------
Title of the seccion 3
xxx
xxx
--------------------------------------------------------------------------------

一个选择是使用

下载谷歌浏览器驱动程序

要获取“xpath”，请右键单击元素，然后单击“复制”并选择“复制xpath”或“复制完整xpath”

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

options = Options()
options.add_argument('--headless') #Opens Chrome in background
driver = webdriver.Chrome(executable_path='Path_to_chromedriver.exe', options=options)
driver.get('url') #Webpage url
Text = driver.find_element("xpath","Element_xpath").Text #Get the label text
driver.close() #Close Chrome