Python 提取i和br标记中的信息并保存在字典中
我有一个HTML页面,我需要在其中提取I标记和br标记中的信息,并将其保存在字典中,如下所示Python 提取i和br标记中的信息并保存在字典中,python,dictionary,beautifulsoup,Python,Dictionary,Beautifulsoup,我有一个HTML页面,我需要在其中提取I标记和br标记中的信息,并将其保存在字典中,如下所示 <div class="rbody"> <div style="color:#ff6666"> </div> <i>objectid: </i> 137000<br/> <i>topoid: </i> 504514394<br/> <i>poigroup: </i> Hyd
<div class="rbody">
<div style="color:#ff6666"> </div>
<i>objectid: </i> 137000<br/>
<i>topoid: </i> 504514394<br/>
<i>poigroup: </i> Hydrography<br/>
<i>poitype: </i> Manmade Waterbody<br/>
<i>poiname: </i> FOUR CORNERS DAM<br/>
<i>poilabel: </i> FOUR CORNERS DAM<br/>
<i>poilabeltype: </i> NAMED<br/>
<i>poialtlabel: </i> <br/>
<i>Point:</i><br/>
<i>X: </i> 1.5778346701624997E7 <br/>
<i>Y: </i> -3861557.6243750006 <br/>
<br/><br/>
</div>
为什么不使用正则表达式呢?您不需要解析实际的HTML(除非您还需要位置信息):
重新导入
data=”“”
目标:137000
拓扑ID:504514394
组:水文学
类型:人造水体
名称:四角坝
POILABLE:四角坝
poilabeltype:已命名
poialtlabel:
要点:
X:1.5778346701624997E7
Y:-3861557.6243750006
"""
parsed=dict(re.findall(r“\s*(*?):*?\s*(*?)\s*
中元素的元素,数据))
打印(已解析)
#{'POIGOUP':'Hydrography','objectid':'137000','topoid':'504514394','poilabeltype':'NAMED','X':'1.5778346701624997E7','Point':'','POILTABLE':'人造水体','poiname':'四角坝','poilabel':'四角坝','Y':'3861557.6243750006'}
如果希望将X和Y转换为浮点等,则可能需要进行额外的后处理。对于通用解决方案,您可能希望尝试将每个值转换为可以接受的值:
def conv(pair):
if len(pair) < 2 or not pair[1]:
return pair[0], None
try:
return pair[0], int(pair[1])
except ValueError:
try:
return pair[0], float(pair[1])
except ValueError:
return pair
parsed = dict(conv(element) for element in re.findall(r"<i>\s*(.*?):.*?</i>\s*(.*?)\s*<br/>", data))
print(parsed)
# {'X': 15778346.701624997, 'Y': -3861557.6243750006, 'objectid': 137000, 'poilabeltype': 'NAMED', 'poialtlabel': None, 'poiname': 'FOUR CORNERS DAM', 'poitype': 'Manmade Waterbody', 'Point': None, 'poilabel': 'FOUR CORNERS DAM', 'topoid': 504514394, 'poigroup': 'Hydrography'}
def conv(成对):
如果len(pair)<2或不是pair[1]:
返回对[0],无
尝试:
返回对[0],int(对[1])
除值错误外:
尝试:
返回对[0],浮点(对[1])
除值错误外:
返回对
parsed=dict(re.findall(r“\s*(*?):*?\s*(*?)\s*
中元素的conv(元素),数据))
打印(已解析)
#{'X':15778346.701624997,'Y':-3861557.6243750006,'objectid':137000,'poilabeltype':'NAMED','PoiltLabel':None,'PoiltLabel','PoiltLabel':None,'Manual Waterbody','Point':None,'Poilabelabel','poilabel','topoid':504514394,'poigroup':'Hydrography'}'
工作原理:很简单,它在
和
标记之间搜索两个匹配的组-一个紧跟其后,允许空白,另一个紧跟其后,再次允许空白。所有这些匹配都被捕获,并使用第一个捕获的组作为键,第二个组作为新的dict
的值进行循环检查以下方法:
from bs4 import BeautifulSoup as Soup
html = """<div class="rbody">
<div style="color:#ff6666"> </div>
<i>objectid: </i> 137000<br/>
<i>topoid: </i> 504514394<br/>
<i>poigroup: </i> Hydrography<br/>
<i>poitype: </i> Manmade Waterbody<br/>
<i>poiname: </i> FOUR CORNERS DAM<br/>
<i>poilabel: </i> FOUR CORNERS DAM<br/>
<i>poilabeltype: </i> NAMED<br/>
<i>poialtlabel: </i> <br/>
<i>Point:</i><br/>
<i>X: </i> 1.5778346701624997E7 <br/>
<i>Y: </i> -3861557.6243750006 <br/>
<br/><br/>
</div>"""
soup = Soup(html, 'html.parser')
obj = dict()
for i in soup.find_all('i'):
key = str(i.get_text()).strip(' :')
value = i.next_sibling
if isinstance(value, NavigableString): # Check this because Point has not value.
obj[key] = str(value).strip()
print(obj)
您可以首先从中获取“br”标记,然后使用方法检索i
标记,然后获取该标记后面的文本
In [81]: from bs4 import BeautifulSoup as BS
In [82]: html = """<div class="rbody">
...: <div style="color:#ff6666"> </div>
...: <i>objectid: </i> 137000<br/>
...: <i>topoid: </i> 504514394<br/>
...: <i>poigroup: </i> Hydrography<br/>
...: <i>poitype: </i> Manmade Waterbody<br/>
...: <i>poiname: </i> FOUR CORNERS DAM<br/>
...: <i>poilabel: </i> FOUR CORNERS DAM<br/>
...: <i>poilabeltype: </i> NAMED<br/>
...: <i>poialtlabel: </i> <br/>
...: <i>Point:</i><br/>
...: <i>X: </i> 1.5778346701624997E7 <br/>
...: <i>Y: </i> -3861557.6243750006 <br/>
...: <br/><br/>
...: </div>"""
In [83]: soup = BS(html, "html.parser")
In [84]: for br in soup.select(".rbody > br"):
...: br.decompose()
...:
In [85]: {i.get_text(strip=True).replace(":", ""): i.next_sibling.strip() for i in soup.select(".rbody > i")}
Out[85]:
{'Point': '',
'X': '1.5778346701624997E7',
'Y': '-3861557.6243750006',
'objectid': '137000',
'poialtlabel': '',
'poigroup': 'Hydrography',
'poilabel': 'FOUR CORNERS DAM',
'poilabeltype': 'NAMED',
'poiname': 'FOUR CORNERS DAM',
'poitype': 'Manmade Waterbody',
'topoid': '504514394'}
[81]中的:从bs4导入BeautifulSoup作为BS
在[82]:html=“”
...:
…:objectid:137000
…:拓扑ID:504514394
组:水文学
…:POI类型:人造水体
…:poiname:四角坝
…:POILABLE:四角坝
…:poilabeltype:命名
…:poialtlabel:
…:点:
…:X:1.5778346701624997E7
…:Y:-3861557.6243750006
…:
...: """
在[83]中:soup=BS(html,“html.parser”)
在[84]中:对于汤中的br,选择(“.rbody>br”):
…:br.分解()
...:
在[85]中:{i.get_text(strip=True)。替换(“:”,“”):汤中i的i.next_sibling.strip()。选择(“.rbody>i”)}
出[85]:
{'Point':'',
“X”:“1.5778346701624997E7”,
‘Y’:‘-3861557.6243750006’,
'objectid':'137000',
“poialtlabel”:“,
“水文学组”:“水文学”,
“POILABLE”:“四角坝”,
'poilabeltype':'NAME',
“poiname”:“四角坝”,
“poitype”:“人造水体”,
'topoid':'504514394'}
感谢您提供的解决方案。我确实试过了。但它给我带来了一个类型错误:预期的字符串或缓冲区。我想象您传递的是BS对象,而不是上面示例中的数据。确保将字符串数据传递给它(例如,如果适用,首先调用它get_text()
)。感谢您提供的解决方案。我试过了,但它给了我一个空的字典输出。实际上我使用的是2.7.x。但我认为版本应该没有任何区别。捕获的键应该从尾随的冒号和空格中去掉,以匹配OP所需的字典。我编辑了答案,以便使用python2并删除空格和:
查尔斯
from bs4 import BeautifulSoup as Soup
html = """<div class="rbody">
<div style="color:#ff6666"> </div>
<i>objectid: </i> 137000<br/>
<i>topoid: </i> 504514394<br/>
<i>poigroup: </i> Hydrography<br/>
<i>poitype: </i> Manmade Waterbody<br/>
<i>poiname: </i> FOUR CORNERS DAM<br/>
<i>poilabel: </i> FOUR CORNERS DAM<br/>
<i>poilabeltype: </i> NAMED<br/>
<i>poialtlabel: </i> <br/>
<i>Point:</i><br/>
<i>X: </i> 1.5778346701624997E7 <br/>
<i>Y: </i> -3861557.6243750006 <br/>
<br/><br/>
</div>"""
soup = Soup(html, 'html.parser')
obj = dict()
for i in soup.find_all('i'):
key = str(i.get_text()).strip(' :')
value = i.next_sibling
if isinstance(value, NavigableString): # Check this because Point has not value.
obj[key] = str(value).strip()
print(obj)
{
'poilabeltype': 'NAMED',
'objectid': '137000',
'poilabel': 'FOUR CORNERS DAM',
'poialtlabel': '',
'poigroup': 'Hydrography',
'Y': '-3861557.6243750006',
'X': '1.5778346701624997E7',
'poiname': 'FOUR CORNERS DAM',
'poitype': 'Manmade Waterbody',
'topoid': '504514394'
}
In [81]: from bs4 import BeautifulSoup as BS
In [82]: html = """<div class="rbody">
...: <div style="color:#ff6666"> </div>
...: <i>objectid: </i> 137000<br/>
...: <i>topoid: </i> 504514394<br/>
...: <i>poigroup: </i> Hydrography<br/>
...: <i>poitype: </i> Manmade Waterbody<br/>
...: <i>poiname: </i> FOUR CORNERS DAM<br/>
...: <i>poilabel: </i> FOUR CORNERS DAM<br/>
...: <i>poilabeltype: </i> NAMED<br/>
...: <i>poialtlabel: </i> <br/>
...: <i>Point:</i><br/>
...: <i>X: </i> 1.5778346701624997E7 <br/>
...: <i>Y: </i> -3861557.6243750006 <br/>
...: <br/><br/>
...: </div>"""
In [83]: soup = BS(html, "html.parser")
In [84]: for br in soup.select(".rbody > br"):
...: br.decompose()
...:
In [85]: {i.get_text(strip=True).replace(":", ""): i.next_sibling.strip() for i in soup.select(".rbody > i")}
Out[85]:
{'Point': '',
'X': '1.5778346701624997E7',
'Y': '-3861557.6243750006',
'objectid': '137000',
'poialtlabel': '',
'poigroup': 'Hydrography',
'poilabel': 'FOUR CORNERS DAM',
'poilabeltype': 'NAMED',
'poiname': 'FOUR CORNERS DAM',
'poitype': 'Manmade Waterbody',
'topoid': '504514394'}