我需要使用python从html页面提取一些数据
这是html页面的一部分,我需要从中提取以下项目: 名字来自强标签,分类类型(演员和歌手),出生和死亡地点我需要使用python从html页面提取一些数据,python,python-3.x,beautifulsoup,Python,Python 3.x,Beautifulsoup,这是html页面的一部分,我需要从中提取以下项目: 名字来自强标签,分类类型(演员和歌手),出生和死亡地点 <li class="clearfix"> <div style="margin-top:10px;"> <div class="float-left" style="margin-bottom:10px;"> <a href="http://" title="Elvis Presley" name="Elvis
<li class="clearfix">
<div style="margin-top:10px;">
<div class="float-left" style="margin-bottom:10px;">
<a href="http://" title="Elvis Presley" name="Elvis Presley" class="float-left">
<strong>Mr. Elvis Presley</strong></a>
</div>
<div class="rating_overall fleft" style="margin:0px 0px 0px 10px;">
<div class="rating_overall voted_rating_overall" style='width:72.96px;'></div>
</div>
<span class="result-vote float-left" id="result" style="line-height:15px; color: #AAA; font-size: 0.9em; margin-top: 1px;"> (15 vots)</span>
<div class="clear"></div>
<a href="http://" title="Mr. Elvis Presley" name="Mr. Elvis Presley">
<img style="float:left;" src="http://a.jpg" alt="Mr. Elvis Presley" title="Mr. Elvis Presley" />
</a>
<br/>
<p>
<b>Classification:</b>
<a href="http://" title="Actor " name="Actor " class="underline">Actor </a>
, <a href="" title="Singer" name="Singer" class="underline">Singer</a>
<br />
<b>Born:</b> <a href="http://" title="Tupelo" name="Tupelo" class="underline">Tupelo</a><br />
<b>Died:</b>
Memphis,
<!--<b>City:</b>-->
<a href="http://" title="Memphis" name="Memphis" class="underline">Memphis</a>
</p>
<div class="clk"></div>
</div>
</li>
虽然我没有从脚本中得到任何错误,但我只提取了名称和第一个分类。我如何定位我需要的其余元素:分类(“Singer”)和出生和死亡位置?您可以使用beautiful soup for html解析器,我首先向您展示beautiful soup,然后展示regex,并通过组捕获捕获结果: 首先是美味的汤: 第二个是正则表达式: 如果表单代码与此处显示的相同,请使用它:
import re
string_1="""<li class="clearfix">
<div style="margin-top:10px;">
<div class="float-left" style="margin-bottom:10px;">
<a href="http://" title="Elvis Presley" name="Elvis Presley" class="float-left">
<strong>Mr. Elvis Presley</strong></a>
</div>
<div class="rating_overall fleft" style="margin:0px 0px 0px 10px;">
<div class="rating_overall voted_rating_overall" style='width:72.96px;'></div>
</div>
<span class="result-vote float-left" id="result" style="line-height:15px; color: #AAA; font-size: 0.9em; margin-top: 1px;"> (15 vots)</span>
<div class="clear"></div>
<a href="http://" title="Mr. Elvis Presley" name="Mr. Elvis Presley">
<img style="float:left;" src="http://a.jpg" alt="Mr. Elvis Presley" title="Mr. Elvis Presley" />
</a>
<br/>
<p>
<b>Classification:</b>
<a href="http://" title="Actor " name="Actor " class="underline">Actor </a>
, <a href="" title="Singer" name="Singer" class="underline">Singer</a>
<br />
<b>Born:</b> <a href="http://" title="Tupelo" name="Tupelo" class="underline">Tupelo</a><br />
<b>Died:</b>
Memphis,
<!--<b>City:</b>-->
<a href="http://" title="Memphis" name="Memphis" class="underline">Memphis</a>
</p>
<div class="clk"></div>
</div>
</li>"""
pattern=r'<strong>(\w.+)<\/strong>|<b>Classification:<\/b>(\s.+)(\s.+)|(Born:.+)|(Died:.+\s.+\s.+\s.+)'
pattern_2=r'name=["](\w.+?)["]'
match=re.finditer(pattern,string_1,re.M)
for find in match:
if find.group(1):
print("Name {}".format(find.group(1)))
if find.group(2):
print("Classificiation first {}".format(re.search(pattern_2,str(find.group(2))).group(1)))
print("Classification second {}".format(re.search(pattern_2,str(find.group(3))).group(1)))
if find.group(4):
print("Born {}".format(re.search(pattern_2, str(find.group(4))).group(1)))
if find.group(5):
print("Dead {}".format(re.search(pattern_2, str(find.group(5))).group(1)))
你有错误吗?如果是,请编辑您的问题并添加错误。如果问题是其他问题,请描述它。瞧,你的程序在运行时会做什么?我没有出错。我只提取了名称和第一个分类。我如何定位我需要的其余元素:分类(“Singer”)和出生和死亡位置?你为什么不
查找所有(class='underline')
然后选择0,1,2,3?你能告诉我怎么做吗?@florin查看我的更新答案,我已经按照你的要求更新了beautifulsoup的答案,如果这解决了你的问题,你可以接受答案。一般来说,不建议使用正则表达式解析HTML(以及其他内容)。OP(实际上我们所有人)最好有一个合适的解析器。非常感谢Nick!你能告诉我如何定位其余3个元素吗?@Nick我已经用BS html解析器和正则表达式添加了这两种方法。@Ayodhyankitpul但是如果我有更多包含相同元素的li标记,我如何在所有元素中搜索?@florin first find_all'li'。
string_1="""<li class="clearfix">
<div style="margin-top:10px;">
<div class="float-left" style="margin-bottom:10px;">
<a href="http://" title="Elvis Presley" name="Elvis Presley" class="float-left">
<strong>Mr. Elvis Presley</strong></a>
</div>
<div class="rating_overall fleft" style="margin:0px 0px 0px 10px;">
<div class="rating_overall voted_rating_overall" style='width:72.96px;'></div>
</div>
<span class="result-vote float-left" id="result" style="line-height:15px; color: #AAA; font-size: 0.9em; margin-top: 1px;"> (15 vots)</span>
<div class="clear"></div>
<a href="http://" title="Mr. Elvis Presley" name="Mr. Elvis Presley">
<img style="float:left;" src="http://a.jpg" alt="Mr. Elvis Presley" title="Mr. Elvis Presley" />
</a>
<br/>
<p>
<b>Classification:</b>
<a href="http://" title="Actor " name="Actor " class="underline">Actor </a>
, <a href="" title="Singer" name="Singer" class="underline">Singer</a>
<br />
<b>Born:</b> <a href="http://" title="Tupelo" name="Tupelo" class="underline">Tupelo</a><br />
<b>Died:</b>
Memphis,
<!--<b>City:</b>-->
<a href="http://" title="Memphis" name="Memphis" class="underline">Memphis</a>
</p>
<div class="clk"></div>
</div>
</li>"""
from bs4 import BeautifulSoup
soup=BeautifulSoup(string_1,"html.parser")
for a in soup.find_all('a'):
print(a['name'])
Elvis Presley
Mr. Elvis Presley
Actor
Singer
Tupelo
Memphis
import re
string_1="""<li class="clearfix">
<div style="margin-top:10px;">
<div class="float-left" style="margin-bottom:10px;">
<a href="http://" title="Elvis Presley" name="Elvis Presley" class="float-left">
<strong>Mr. Elvis Presley</strong></a>
</div>
<div class="rating_overall fleft" style="margin:0px 0px 0px 10px;">
<div class="rating_overall voted_rating_overall" style='width:72.96px;'></div>
</div>
<span class="result-vote float-left" id="result" style="line-height:15px; color: #AAA; font-size: 0.9em; margin-top: 1px;"> (15 vots)</span>
<div class="clear"></div>
<a href="http://" title="Mr. Elvis Presley" name="Mr. Elvis Presley">
<img style="float:left;" src="http://a.jpg" alt="Mr. Elvis Presley" title="Mr. Elvis Presley" />
</a>
<br/>
<p>
<b>Classification:</b>
<a href="http://" title="Actor " name="Actor " class="underline">Actor </a>
, <a href="" title="Singer" name="Singer" class="underline">Singer</a>
<br />
<b>Born:</b> <a href="http://" title="Tupelo" name="Tupelo" class="underline">Tupelo</a><br />
<b>Died:</b>
Memphis,
<!--<b>City:</b>-->
<a href="http://" title="Memphis" name="Memphis" class="underline">Memphis</a>
</p>
<div class="clk"></div>
</div>
</li>"""
pattern=r'<strong>(\w.+)<\/strong>|<b>Classification:<\/b>(\s.+)(\s.+)|(Born:.+)|(Died:.+\s.+\s.+\s.+)'
pattern_2=r'name=["](\w.+?)["]'
match=re.finditer(pattern,string_1,re.M)
for find in match:
if find.group(1):
print("Name {}".format(find.group(1)))
if find.group(2):
print("Classificiation first {}".format(re.search(pattern_2,str(find.group(2))).group(1)))
print("Classification second {}".format(re.search(pattern_2,str(find.group(3))).group(1)))
if find.group(4):
print("Born {}".format(re.search(pattern_2, str(find.group(4))).group(1)))
if find.group(5):
print("Dead {}".format(re.search(pattern_2, str(find.group(5))).group(1)))
Name Mr. Elvis Presley
Classificiation first Actor
Classification second Singer
Born Tupelo
Dead Memphis