Python beautifulsoup:获取html标记内部的内容
我正在开发一个可以翻译html标记内文本的转换器,我正在使用beautifulsoup,因为它是python中最好的html解析器之一 这是文本并将其加载到汤中Python beautifulsoup:获取html标记内部的内容,python,beautifulsoup,Python,Beautifulsoup,我正在开发一个可以翻译html标记内文本的转换器,我正在使用beautifulsoup,因为它是python中最好的html解析器之一 这是文本并将其加载到汤中 In [95]: chalet.html
In [95]: chalet.html
Out[95]: '<h4><strong>“Create a space I would be truly excited to stay in”.</strong></h4>\r\n\r\n<h4><strong>That was the brief given to renowned architect, Herve Marullaz, after Chalet Joux Plane’s owner secured a large plot of mountain land that backed onto a stream and an alpine woodland. The result was Chalet</strong> <strong>Belle Chéry.</strong></h4>\r\n\r\n<p>Belle Chéry is a chalet built without constraint. A destination, to be experienced. The building itself nestles into the mountain and you enter down a 15m underground tunnel that takes you from the garage and boot room into the very heart of the chalet.</p>\r\n\r\n<p>The chalet itself is spread over 680m<sup>2</sup>, with one side of the chalet almost entirely glazed, offering mountain views from all the living spaces and entertainment areas. The chalet can sleep up to 14 guests across 5 luxurious bedrooms and a family/children’s bunk room, all opening out onto secluded terraces and enjoying free standing baths and hanging seats.</p>\r\n\r\n<p>The specification list for the chalet is, of course, almost endless and includes a 23-meter indoor-outdoor swimming pool, a Bamford Spa including treatment room and sauna, a private gym, a cinema room, art gallery and children’s playroom. The living space is vast and includes a luxurious lounge area with open fireplace and delicious sofas, a library, a floating mezzanine dining area and a bar mezzanine with balcony overlooking the mountains.</p>'
In [96]: html = soup(chalet.html)
In [97]: print(chalet.html)
<h4><strong>“Create a space I would be truly excited to stay in”.</strong></h4>
<h4><strong>That was the brief given to renowned architect, Herve Marullaz, after Chalet Joux Plane’s owner secured a large plot of mountain land that backed onto a stream and an alpine woodland. The result was Chalet</strong> <strong>Belle Chéry.</strong></h4>
<p>Belle Chéry is a chalet built without constraint. A destination, to be experienced. The building itself nestles into the mountain and you enter down a 15m underground tunnel that takes you from the garage and boot room into the very heart of the chalet.</p>
<p>The chalet itself is spread over 680m<sup>2</sup>, with one side of the chalet almost entirely glazed, offering mountain views from all the living spaces and entertainment areas. The chalet can sleep up to 14 guests across 5 luxurious bedrooms and a family/children’s bunk room, all opening out onto secluded terraces and enjoying free standing baths and hanging seats.</p>
<p>The specification list for the chalet is, of course, almost endless and includes a 23-meter indoor-outdoor swimming pool, a Bamford Spa including treatment room and sauna, a private gym, a cinema room, art gallery and children’s playroom. The living space is vast and includes a luxurious lounge area with open fireplace and delicious sofas, a library, a floating mezzanine dining area and a bar mezzanine with balcony overlooking the mountains.</p>
下一步是将其分解为内容,以便我可以解析它们
In [105]: html.contents
Out[105]:
[<h4><strong>“Create a space I would be truly excited to stay in”.</strong></h4>,
'\n',
<h4><strong>That was the brief given to renowned architect, Herve Marullaz, after Chalet Joux Plane’s owner secured a large plot of mountain land that backed onto a stream and an alpine woodland. The result was Chalet</strong> <strong>Belle Chéry.</strong></h4>,
'\n',
<p>Belle Chéry is a chalet built without constraint. A destination, to be experienced. The building itself nestles into the mountain and you enter down a 15m underground tunnel that takes you from the garage and boot room into the very heart of the chalet.</p>,
'\n',
<p>The chalet itself is spread over 680m<sup>2</sup>, with one side of the chalet almost entirely glazed, offering mountain views from all the living spaces and entertainment areas. The chalet can sleep up to 14 guests across 5 luxurious bedrooms and a family/children’s bunk room, all opening out onto secluded terraces and enjoying free standing baths and hanging seats.</p>,
'\n',
<p>The specification list for the chalet is, of course, almost endless and includes a 23-meter indoor-outdoor swimming pool, a Bamford Spa including treatment room and sauna, a private gym, a cinema room, art gallery and children’s playroom. The living space is vast and includes a luxurious lounge area with open fireplace and delicious sofas, a library, a floating mezzanine dining area and a bar mezzanine with balcony overlooking the mountains.</p>]
在这两行之间的东西是新的,我可以用try-and-catch块忽略它们,但是获取字符串似乎只对其中的一些行有效,而不是全部行
In [107]: contents[0]
Out[107]: <h4><strong>“Create a space I would be truly excited to stay in”.</strong></h4>
In [108]: contents[0].string
Out[108]: '“Create a space I would be truly excited to stay in”.'
In [109]: contents[1]
Out[109]: '\n'
In [110]: contents[1].string
Out[110]: '\n'
In [111]: contents[2]
Out[111]: <h4><strong>That was the brief given to renowned architect, Herve Marullaz, after Chalet Joux Plane’s owner secured a large plot of mountain land that backed onto a stream and an alpine woodland. The result was Chalet</strong> <strong>Belle Chéry.</strong></h4>
In [112]: contents[2].string
如果您知道如何提取这些部分,而不会在它们之间剥离标记,那么replace将处理主字符串。使用.stripped_strings属性从HTML中获取干净的剥离文本
从bs4导入BeautifulSoup
从pprint导入pprint
html='1〕
&ldquo;创建一个我非常乐意呆在其中的空间&rdquo;。
这是著名建筑师Herve Marullaz在Chalet Joux Plane之后的简介;他的主人获得了一大片山地,这片山地背靠着一条小溪和一片高山林地。结果是贝勒切特小屋;里。
贝勒切特;里约热内卢是一座没有约束的小屋。目的地,有待体验。这座建筑本身就坐落在山上,你进入一条15米长的地下隧道,从车库和后备箱进入小屋的核心
小屋本身占地680m2,小屋一侧几乎完全上釉,从所有生活空间和娱乐区都可以看到山景。小屋可容纳多达14位客人入住5间豪华卧室和一个家庭/儿童&rsquo;s的双层房,所有的房间都通向隐蔽的露台,享受独立式浴缸和悬挂式座椅
当然,小屋的规格几乎是无穷无尽的,包括一个23米长的室内外游泳池、一个包括治疗室和桑拿的班福德水疗中心、一个私人健身房、一个电影院、一个艺术画廊和儿童房;这是游戏室。生活空间广阔,包括一个带开放式壁炉和美味沙发的豪华休息室、一个图书馆、一个浮动夹层餐厅和一个带阳台的酒吧夹层,可以俯瞰群山
'
soup=BeautifulSouphtml,“html.parser”
text=[*soup.stripped\u字符串]
PPrintText
输出:
['“Create a space I would be truly excited to stay in”.',
'That was the brief given to renowned architect, Herve Marullaz, after Chalet '
'Joux Plane’s owner secured a large plot of mountain land that backed onto a '
'stream and an alpine woodland. The result was Chalet',
'Belle Chéry.',
'Belle Chéry is a chalet built without constraint. A destination, to be '
...
“Create a space I would be truly excited to stay in”. That was the brief given to renowned architect, Herve Marullaz, after Chalet Joux Plane’s owner secured a large plot of mountain land that backed onto a stream and an alpine woodland. The result was Chalet Belle C ...
要获取单个长字符串,请执行以下操作:
long_string = ' '.join(texts)
输出:
['“Create a space I would be truly excited to stay in”.',
'That was the brief given to renowned architect, Herve Marullaz, after Chalet '
'Joux Plane’s owner secured a large plot of mountain land that backed onto a '
'stream and an alpine woodland. The result was Chalet',
'Belle Chéry.',
'Belle Chéry is a chalet built without constraint. A destination, to be '
...
“Create a space I would be truly excited to stay in”. That was the brief given to renowned architect, Herve Marullaz, after Chalet Joux Plane’s owner secured a large plot of mountain land that backed onto a stream and an alpine woodland. The result was Chalet Belle C ...
您可以使用列表comp和str.join来加入不带换行符的内容列表,以获得所需的输出:
contents = ''.join([data for data in html.contents if data != '\n'])
现在,您可以创建汤:
soup = BeautifulSoup(contents, 'lxml')
用首选解析器替换lxml。如何获得html字符串的输出?这个输出很好地处理了标记。strsoup将为您提供HTML