Python 提取i和br标记中的信息并保存在字典中

Python 提取i和br标记中的信息并保存在字典中,python,dictionary,beautifulsoup,Python,Dictionary,Beautifulsoup,我有一个HTML页面,我需要在其中提取I标记和br标记中的信息,并将其保存在字典中,如下所示 <div class="rbody"> <div style="color:#ff6666"> </div> <i>objectid: </i> 137000<br/> <i>topoid: </i> 504514394<br/> <i>poigroup: </i> Hyd

我有一个HTML页面,我需要在其中提取I标记和br标记中的信息,并将其保存在字典中,如下所示

<div class="rbody">
<div style="color:#ff6666"> </div>
<i>objectid: </i> 137000<br/>
<i>topoid: </i> 504514394<br/>
<i>poigroup: </i> Hydrography<br/>
<i>poitype: </i> Manmade Waterbody<br/>
<i>poiname: </i> FOUR CORNERS DAM<br/>
<i>poilabel: </i> FOUR CORNERS DAM<br/>
<i>poilabeltype: </i> NAMED<br/>
<i>poialtlabel: </i> <br/>
<i>Point:</i><br/>
<i>X: </i> 1.5778346701624997E7 <br/>
<i>Y: </i> -3861557.6243750006 <br/>
<br/><br/>
</div>

为什么不使用正则表达式呢?您不需要解析实际的HTML(除非您还需要位置信息):

重新导入
data=”“”
目标:137000
拓扑ID:504514394
组:水文学
类型:人造水体
名称:四角坝
POILABLE:四角坝
poilabeltype:已命名
poialtlabel:
要点:
X:1.5778346701624997E7
Y:-3861557.6243750006


""" parsed=dict(re.findall(r“\s*(*?):*?\s*(*?)\s*
中元素的元素,数据)) 打印(已解析) #{'POIGOUP':'Hydrography','objectid':'137000','topoid':'504514394','poilabeltype':'NAMED','X':'1.5778346701624997E7','Point':'','POILTABLE':'人造水体','poiname':'四角坝','poilabel':'四角坝','Y':'3861557.6243750006'}
如果希望将X和Y转换为浮点等,则可能需要进行额外的后处理。对于通用解决方案,您可能希望尝试将每个值转换为可以接受的值:

def conv(pair):
    if len(pair) < 2 or not pair[1]:
        return pair[0], None
    try:
        return pair[0], int(pair[1])
    except ValueError:
        try:
            return pair[0], float(pair[1])
        except ValueError:
            return pair

parsed = dict(conv(element) for element in re.findall(r"<i>\s*(.*?):.*?</i>\s*(.*?)\s*<br/>", data))
print(parsed)
# {'X': 15778346.701624997, 'Y': -3861557.6243750006, 'objectid': 137000, 'poilabeltype': 'NAMED', 'poialtlabel': None, 'poiname': 'FOUR CORNERS DAM', 'poitype': 'Manmade Waterbody', 'Point': None, 'poilabel': 'FOUR CORNERS DAM', 'topoid': 504514394, 'poigroup': 'Hydrography'}
def conv(成对):
如果len(pair)<2或不是pair[1]:
返回对[0],无
尝试:
返回对[0],int(对[1])
除值错误外:
尝试:
返回对[0],浮点(对[1])
除值错误外:
返回对
parsed=dict(re.findall(r“\s*(*?):*?\s*(*?)\s*
中元素的conv(元素),数据)) 打印(已解析) #{'X':15778346.701624997,'Y':-3861557.6243750006,'objectid':137000,'poilabeltype':'NAMED','PoiltLabel':None,'PoiltLabel','PoiltLabel':None,'Manual Waterbody','Point':None,'Poilabelabel','poilabel','topoid':504514394,'poigroup':'Hydrography'}'

工作原理:很简单,它在

标记之间搜索两个匹配的组-一个紧跟其后,允许空白,另一个紧跟其后,再次允许空白。所有这些匹配都被捕获,并使用第一个捕获的组作为键,第二个组作为新的
dict

的值进行循环检查以下方法:

from bs4 import BeautifulSoup as Soup

html = """<div class="rbody">
<div style="color:#ff6666"> </div>
<i>objectid: </i> 137000<br/>
<i>topoid: </i> 504514394<br/>
<i>poigroup: </i> Hydrography<br/>
<i>poitype: </i> Manmade Waterbody<br/>
<i>poiname: </i> FOUR CORNERS DAM<br/>
<i>poilabel: </i> FOUR CORNERS DAM<br/>
<i>poilabeltype: </i> NAMED<br/>
<i>poialtlabel: </i> <br/>
<i>Point:</i><br/>
<i>X: </i> 1.5778346701624997E7 <br/>
<i>Y: </i> -3861557.6243750006 <br/>
<br/><br/>
</div>"""

soup = Soup(html, 'html.parser')

obj = dict()
for i in soup.find_all('i'):
    key = str(i.get_text()).strip(' :')
    value = i.next_sibling
    if isinstance(value, NavigableString): # Check this because Point has not value.
        obj[key] = str(value).strip()
print(obj)
您可以首先从中获取“br”标记,然后使用方法检索
i
标记,然后获取该标记后面的文本

In [81]: from bs4 import BeautifulSoup as BS

In [82]: html = """<div class="rbody">
    ...: <div style="color:#ff6666"> </div>
    ...: <i>objectid: </i> 137000<br/>
    ...: <i>topoid: </i> 504514394<br/>
    ...: <i>poigroup: </i> Hydrography<br/>
    ...: <i>poitype: </i> Manmade Waterbody<br/>
    ...: <i>poiname: </i> FOUR CORNERS DAM<br/>
    ...: <i>poilabel: </i> FOUR CORNERS DAM<br/>
    ...: <i>poilabeltype: </i> NAMED<br/>
    ...: <i>poialtlabel: </i> <br/>
    ...: <i>Point:</i><br/>
    ...: <i>X: </i> 1.5778346701624997E7 <br/>
    ...: <i>Y: </i> -3861557.6243750006 <br/>
    ...: <br/><br/>
    ...: </div>"""

In [83]: soup = BS(html, "html.parser")

In [84]: for br in soup.select(".rbody > br"):
    ...:     br.decompose()
    ...:     

In [85]: {i.get_text(strip=True).replace(":", ""): i.next_sibling.strip() for i in soup.select(".rbody > i")}
Out[85]: 
{'Point': '',
 'X': '1.5778346701624997E7',
 'Y': '-3861557.6243750006',
 'objectid': '137000',
 'poialtlabel': '',
 'poigroup': 'Hydrography',
 'poilabel': 'FOUR CORNERS DAM',
 'poilabeltype': 'NAMED',
 'poiname': 'FOUR CORNERS DAM',
 'poitype': 'Manmade Waterbody',
 'topoid': '504514394'}
[81]中的
:从bs4导入BeautifulSoup作为BS
在[82]:html=“”
...:  
…:objectid:137000
…:拓扑ID:504514394
组:水文学
…:POI类型:人造水体
…:poiname:四角坝
…:POILABLE:四角坝
…:poilabeltype:命名
…:poialtlabel:
…:点:
…:X:1.5778346701624997E7
…:Y:-3861557.6243750006
…:

...: """ 在[83]中:soup=BS(html,“html.parser”) 在[84]中:对于汤中的br,选择(“.rbody>br”): …:br.分解() ...: 在[85]中:{i.get_text(strip=True)。替换(“:”,“”):汤中i的i.next_sibling.strip()。选择(“.rbody>i”)} 出[85]: {'Point':'', “X”:“1.5778346701624997E7”, ‘Y’:‘-3861557.6243750006’, 'objectid':'137000', “poialtlabel”:“, “水文学组”:“水文学”, “POILABLE”:“四角坝”, 'poilabeltype':'NAME', “poiname”:“四角坝”, “poitype”:“人造水体”, 'topoid':'504514394'}
感谢您提供的解决方案。我确实试过了。但它给我带来了一个类型错误:预期的字符串或缓冲区。我想象您传递的是BS对象,而不是上面示例中的
数据。确保将字符串数据传递给它(例如,如果适用,首先调用它
get_text()
)。感谢您提供的解决方案。我试过了,但它给了我一个空的字典输出。实际上我使用的是2.7.x。但我认为版本应该没有任何区别。捕获的键应该从尾随的冒号和空格中去掉,以匹配OP所需的字典。我编辑了答案,以便使用python2并删除空格和
查尔斯
from bs4 import BeautifulSoup as Soup

html = """<div class="rbody">
<div style="color:#ff6666"> </div>
<i>objectid: </i> 137000<br/>
<i>topoid: </i> 504514394<br/>
<i>poigroup: </i> Hydrography<br/>
<i>poitype: </i> Manmade Waterbody<br/>
<i>poiname: </i> FOUR CORNERS DAM<br/>
<i>poilabel: </i> FOUR CORNERS DAM<br/>
<i>poilabeltype: </i> NAMED<br/>
<i>poialtlabel: </i> <br/>
<i>Point:</i><br/>
<i>X: </i> 1.5778346701624997E7 <br/>
<i>Y: </i> -3861557.6243750006 <br/>
<br/><br/>
</div>"""

soup = Soup(html, 'html.parser')

obj = dict()
for i in soup.find_all('i'):
    key = str(i.get_text()).strip(' :')
    value = i.next_sibling
    if isinstance(value, NavigableString): # Check this because Point has not value.
        obj[key] = str(value).strip()
print(obj)
{
  'poilabeltype': 'NAMED',
  'objectid': '137000',
  'poilabel': 'FOUR CORNERS DAM',
  'poialtlabel': '',
  'poigroup': 'Hydrography',
  'Y': '-3861557.6243750006',
  'X': '1.5778346701624997E7',
  'poiname': 'FOUR CORNERS DAM',
  'poitype': 'Manmade Waterbody',
  'topoid': '504514394'
}
In [81]: from bs4 import BeautifulSoup as BS

In [82]: html = """<div class="rbody">
    ...: <div style="color:#ff6666"> </div>
    ...: <i>objectid: </i> 137000<br/>
    ...: <i>topoid: </i> 504514394<br/>
    ...: <i>poigroup: </i> Hydrography<br/>
    ...: <i>poitype: </i> Manmade Waterbody<br/>
    ...: <i>poiname: </i> FOUR CORNERS DAM<br/>
    ...: <i>poilabel: </i> FOUR CORNERS DAM<br/>
    ...: <i>poilabeltype: </i> NAMED<br/>
    ...: <i>poialtlabel: </i> <br/>
    ...: <i>Point:</i><br/>
    ...: <i>X: </i> 1.5778346701624997E7 <br/>
    ...: <i>Y: </i> -3861557.6243750006 <br/>
    ...: <br/><br/>
    ...: </div>"""

In [83]: soup = BS(html, "html.parser")

In [84]: for br in soup.select(".rbody > br"):
    ...:     br.decompose()
    ...:     

In [85]: {i.get_text(strip=True).replace(":", ""): i.next_sibling.strip() for i in soup.select(".rbody > i")}
Out[85]: 
{'Point': '',
 'X': '1.5778346701624997E7',
 'Y': '-3861557.6243750006',
 'objectid': '137000',
 'poialtlabel': '',
 'poigroup': 'Hydrography',
 'poilabel': 'FOUR CORNERS DAM',
 'poilabeltype': 'NAMED',
 'poiname': 'FOUR CORNERS DAM',
 'poitype': 'Manmade Waterbody',
 'topoid': '504514394'}