如何使用Python从html获取段落

如何使用Python从html获取段落,python,html,regex,beautifulsoup,Python,Html,Regex,Beautifulsoup,如何从结构不良的html中提取段落 我有以下原始html文本: This position is responsible for developing and implementing age appropriate lesson and activity plans for preschool children, ages 4-5 years-old. Maintain a fun and interactive classroom that is clean and well organiz

如何从结构不良的html中提取段落

我有以下原始html文本:

This position is responsible for developing and implementing age appropriate lesson and activity plans for preschool children, ages 4-5 years-old. Maintain a fun and interactive classroom that is clean and well organized, provide a safe, healthy and welcoming learning environment. The ideal candidate will have:
<br>
<ul>
    <li>AA Early Childhood Education, or related field.  </li>
    <li>2+ years experience in a licensed childcare facility  </li>
    <li>Ability to meet state requirements, including finger print clearance.  </li>
    <li>Excellent oral and written communication skills  </li>
    <li>Strong organization and time management skills.  </li>
    <li>Creativity in expanding children's learning through play.<br>  </li>
    <li>Strong classroom management skills.<br>  </li>
</ul>
<p>The ideal candidate must be a reliable, self-starting professional who is passionate about teaching young children. 
    <br> 
</p>
<html>

<body>
    <p>This position is responsible for developing and implementing age appropriate lesson and activity plans for preschool children, ages 4-5 years-old. Maintain a fun and interactive classroom that is clean and well organized, provide a safe, healthy and welcoming learning environment. The ideal candidate will have:
        AA Early Childhood Education, or related field.
        2+ years experience in a licensed childcare facility
        Ability to meet state requirements, including finger print clearance.
        Excellent oral and written communication skills
        Strong organization and time management skills.
        Creativity in expanding children's learning through play.
        Strong classroom management skills.
    </p>
    <p>The ideal candidate must be a reliable, self-starting professional who is passionate about teaching young children.</p>
</body>

</html>
它返回一个新的html文本,其中包含2个段落:

<html>

<body>
    <p>This position is responsible for developing and implementing age appropriate lesson and activity plans for preschool children, ages 4-5 years-old. Maintain a fun and interactive classroom that is clean and well organized, provide a safe, healthy and welcoming learning environment. The ideal candidate will have:
        <br/>
    </p>
    <ul>
        <li>AA Early Childhood Education, or related field. </li>
        <li>2+ years experience in a licensed childcare facility </li>
        <li>Ability to meet state requirements, including finger print clearance. </li>
        <li>Excellent oral and written communication skills </li>
        <li>Strong organization and time management skills. </li>
        <li>Creativity in expanding children's learning through play.
            <br/> </li>
        <li>Strong classroom management skills.
            <br/> </li>
    </ul>
    <p>The ideal candidate must be a reliable, self-starting professional who is passionate about teaching young children.
        <br/> </p>
</body>

</html>

该职位负责为4-5岁学龄前儿童制定和实施适龄课程和活动计划。保持一个干净、有序、有趣、互动的教室,提供一个安全、健康、友好的学习环境。理想的候选人应具备:

  • AA幼儿教育或相关领域
  • 2年以上特许托儿机构工作经验
  • 能够满足国家要求,包括指纹清除
  • 优秀的口头和书面沟通能力
  • 较强的组织和时间管理能力
  • 通过游戏拓展儿童学习的创造力。
  • 较强的课堂管理技能。
理想的候选人必须是一个可靠的,自我启动的专业人士谁是教学幼儿的热情。

但这不是我所期望的。因此,我希望得到以下html文本:

This position is responsible for developing and implementing age appropriate lesson and activity plans for preschool children, ages 4-5 years-old. Maintain a fun and interactive classroom that is clean and well organized, provide a safe, healthy and welcoming learning environment. The ideal candidate will have:
<br>
<ul>
    <li>AA Early Childhood Education, or related field.  </li>
    <li>2+ years experience in a licensed childcare facility  </li>
    <li>Ability to meet state requirements, including finger print clearance.  </li>
    <li>Excellent oral and written communication skills  </li>
    <li>Strong organization and time management skills.  </li>
    <li>Creativity in expanding children's learning through play.<br>  </li>
    <li>Strong classroom management skills.<br>  </li>
</ul>
<p>The ideal candidate must be a reliable, self-starting professional who is passionate about teaching young children. 
    <br> 
</p>
<html>

<body>
    <p>This position is responsible for developing and implementing age appropriate lesson and activity plans for preschool children, ages 4-5 years-old. Maintain a fun and interactive classroom that is clean and well organized, provide a safe, healthy and welcoming learning environment. The ideal candidate will have:
        AA Early Childhood Education, or related field.
        2+ years experience in a licensed childcare facility
        Ability to meet state requirements, including finger print clearance.
        Excellent oral and written communication skills
        Strong organization and time management skills.
        Creativity in expanding children's learning through play.
        Strong classroom management skills.
    </p>
    <p>The ideal candidate must be a reliable, self-starting professional who is passionate about teaching young children.</p>
</body>

</html>

该职位负责为4-5岁学龄前儿童制定和实施适龄课程和活动计划。保持一个干净、有序、有趣、互动的教室,提供一个安全、健康、友好的学习环境。理想的候选人应具备:
AA幼儿教育或相关领域。
2年以上特许托儿机构工作经验
能够满足国家要求,包括指纹清除。
优秀的口头和书面沟通能力
较强的组织和时间管理能力。
通过游戏拓展儿童学习的创造力。
较强的课堂管理技能。

理想的候选人必须是一个可靠的,自我启动的专业人士谁是教学幼儿的热情

为了获得以上html,我认为最好的方法是从原始html中删除除
之外的所有html标记

为此,我尝试了以下正则表达式:

new_html = re.sub('<[^<]+?>', '', html)

new_html=re.sub(“这是一种手动文档操作,但是,您可以循环使用
li
元素,并将它们放在第一段之后。然后,删除
ul
元素:

from bs4 import BeautifulSoup


data = """
This position is responsible for developing and implementing age appropriate lesson and activity plans for preschool children, ages 4-5 years-old. Maintain a fun and interactive classroom that is clean and well organized, provide a safe, healthy and welcoming learning environment. The ideal candidate will have:
<br>
<ul>
    <li>AA Early Childhood Education, or related field.  </li>
    <li>2+ years experience in a licensed childcare facility  </li>
    <li>Ability to meet state requirements, including finger print clearance.  </li>
    <li>Excellent oral and written communication skills  </li>
    <li>Strong organization and time management skills.  </li>
    <li>Creativity in expanding children's learning through play.<br>  </li>
    <li>Strong classroom management skills.<br>  </li>
</ul>
<p>The ideal candidate must be a reliable, self-starting professional who is passionate about teaching young children.
    <br>
</p>"""

soup = BeautifulSoup(data, "lxml")

p = soup.p
for li in soup.find_all("li"):
    p.append(li.get_text())
    li.extract()

soup.find("ul").extract()
print(soup.prettify())
印刷品:

<html>
 <body>
  <p>
   This position is responsible for developing and implementing age appropriate lesson and activity plans for preschool children, ages 4-5 years-old. Maintain a fun and interactive classroom that is clean and well organized, provide a safe, healthy and welcoming learning environment. The ideal candidate will have:
   AA Early Childhood Education, or related field.
   2+ years experience in a licensed childcare facility
   Ability to meet state requirements, including finger print clearance.
   Excellent oral and written communication skills
   Strong organization and time management skills.
   Creativity in expanding children's learning through play.
   Strong classroom management skills.
  </p>
  <p>
   The ideal candidate must be a reliable, self-starting professional who is passionate about teaching young children.
   <br/>
  </p>
 </body>
</html>


该职位负责为4-5岁的学龄前儿童制定和实施与年龄相适应的课程和活动计划。保持一个干净有序的有趣互动教室,提供一个安全、健康、友好的学习环境。理想的候选人应具备:
AA幼儿教育或相关领域。
2年以上特许托儿机构工作经验
能够满足国家要求,包括指纹清除。
优秀的口头和书面沟通能力
较强的组织和时间管理能力。
通过游戏拓展儿童学习的创造力。
较强的课堂管理技能。

理想的候选人必须是一个可靠的,自我启动的专业人士谁是教学幼儿的热情。


这是一种手动文档操作,但是,您可以在
li
元素上循环,然后将它们转到第一段。然后,删除
ul
元素:

from bs4 import BeautifulSoup


data = """
This position is responsible for developing and implementing age appropriate lesson and activity plans for preschool children, ages 4-5 years-old. Maintain a fun and interactive classroom that is clean and well organized, provide a safe, healthy and welcoming learning environment. The ideal candidate will have:
<br>
<ul>
    <li>AA Early Childhood Education, or related field.  </li>
    <li>2+ years experience in a licensed childcare facility  </li>
    <li>Ability to meet state requirements, including finger print clearance.  </li>
    <li>Excellent oral and written communication skills  </li>
    <li>Strong organization and time management skills.  </li>
    <li>Creativity in expanding children's learning through play.<br>  </li>
    <li>Strong classroom management skills.<br>  </li>
</ul>
<p>The ideal candidate must be a reliable, self-starting professional who is passionate about teaching young children.
    <br>
</p>"""

soup = BeautifulSoup(data, "lxml")

p = soup.p
for li in soup.find_all("li"):
    p.append(li.get_text())
    li.extract()

soup.find("ul").extract()
print(soup.prettify())
印刷品:

<html>
 <body>
  <p>
   This position is responsible for developing and implementing age appropriate lesson and activity plans for preschool children, ages 4-5 years-old. Maintain a fun and interactive classroom that is clean and well organized, provide a safe, healthy and welcoming learning environment. The ideal candidate will have:
   AA Early Childhood Education, or related field.
   2+ years experience in a licensed childcare facility
   Ability to meet state requirements, including finger print clearance.
   Excellent oral and written communication skills
   Strong organization and time management skills.
   Creativity in expanding children's learning through play.
   Strong classroom management skills.
  </p>
  <p>
   The ideal candidate must be a reliable, self-starting professional who is passionate about teaching young children.
   <br/>
  </p>
 </body>
</html>


该职位负责为4-5岁的学龄前儿童制定和实施与年龄相适应的课程和活动计划。保持一个干净有序的有趣互动教室,提供一个安全、健康、友好的学习环境。理想的候选人应具备:
AA幼儿教育或相关领域。
2年以上特许托儿机构工作经验
能够满足国家要求,包括指纹清除。
优秀的口头和书面沟通能力
较强的组织和时间管理能力。
通过游戏拓展儿童学习的创造力。
较强的课堂管理技能。

理想的候选人必须是一个可靠的,自我启动的专业人士谁是教学幼儿的热情。

简短回答
new_html=re.sub('/][^>]+|/[^p]/[^>][^>][^>]+)>,'',html)

长话短说 你原来的正则表达式看起来很奇怪。我会把
[^>]
替换成
[^]+
/[^p]
/[^>][^>]+
),然后

这就是上面的正则表达式所表示的

下面是一个在python控制台中键入的快速测试:

re.sub(
    '<([^p]|[^>/][^>]+|/[^p]|/[^>][^>]+)>', 
    '', 
    'aa <p> bb <a> cc <li> dd <pp> ee <pa> ff </p> gg </a> hh </li> ii </pp> jj </pa> ff')
re.sub(
“/][^>]+|/[^p]][^>][^>][^>]+)>”,
'', 
‘aabb cc
  • dd ee ff

    gg hh
  • ii jj ff’)
    简短回答
    new_html=re.sub('/][^>]+|/[^p]/[^>][^>][^>]+)>,'',html)

    长话短说 你原来的正则表达式看起来很奇怪。我会把
    [^>]
    替换成
    [^]+
    /[^p]
    /[^>][^>]+
    ),然后

    这就是上面的正则表达式所表示的

    下面是一个在python控制台中键入的快速测试:

    re.sub(
        '<([^p]|[^>/][^>]+|/[^p]|/[^>][^>]+)>', 
        '', 
        'aa <p> bb <a> cc <li> dd <pp> ee <pa> ff </p> gg </a> hh </li> ii </pp> jj </pa> ff')
    
    re.sub(
    “/][^>]+|/[^p]][^>][^>][^>]+)>”,
    '', 
    ‘aabb cc
  • dd ee ff

    gg hh
  • ii jj ff’)
    是否要检索文本?如果是,那么
    soup.get_text()
    应该可以。不,我想检索一个段落列表。那么所有那些
    li
    标记呢?是否仅用文本替换它们?是的,并将它们添加到第一段中。是否检索文本?如果是的话