Python beautifulsoup:获取html标记内部的内容

Python beautifulsoup:获取html标记内部的内容,python,beautifulsoup,Python,Beautifulsoup,我正在开发一个可以翻译html标记内文本的转换器,我正在使用beautifulsoup,因为它是python中最好的html解析器之一 这是文本并将其加载到汤中 In [95]: chalet.html

我正在开发一个可以翻译html标记内文本的转换器,我正在使用beautifulsoup,因为它是python中最好的html解析器之一

这是文本并将其加载到汤中

In [95]: chalet.html                                                                                                                                                                       
Out[95]: '<h4><strong>&ldquo;Create a space I would be truly excited to stay in&rdquo;.</strong></h4>\r\n\r\n<h4><strong>That was the brief given to renowned architect, Herve Marullaz, after Chalet Joux Plane&rsquo;s owner secured a large plot of mountain land that backed onto a stream and an alpine woodland. The result was Chalet</strong> <strong>Belle Ch&eacute;ry.</strong></h4>\r\n\r\n<p>Belle Ch&eacute;ry is a chalet built without constraint. A destination, to be experienced. The building itself nestles into the mountain and you enter down a 15m underground tunnel that takes you from the garage and boot room into the very heart of the chalet.</p>\r\n\r\n<p>The chalet itself is spread over 680m<sup>2</sup>, with one side of the chalet almost entirely glazed, offering mountain views from all the living spaces and entertainment areas. The chalet can sleep up to 14 guests across 5 luxurious bedrooms and a family/children&rsquo;s bunk room, all opening out onto secluded terraces and enjoying free standing baths and hanging seats.</p>\r\n\r\n<p>The specification list for the chalet is, of course, almost endless and includes a 23-meter indoor-outdoor swimming pool, a Bamford Spa including treatment room and sauna, a private gym, a cinema room, art gallery and children&rsquo;s playroom. The living space is vast and includes a luxurious lounge area with open fireplace and delicious sofas, a library, a floating mezzanine dining area and a bar mezzanine with balcony overlooking the mountains.</p>'

In [96]: html = soup(chalet.html)                                                                                                                                                          

In [97]: print(chalet.html)                                                                                                                                                                
<h4><strong>&ldquo;Create a space I would be truly excited to stay in&rdquo;.</strong></h4>

<h4><strong>That was the brief given to renowned architect, Herve Marullaz, after Chalet Joux Plane&rsquo;s owner secured a large plot of mountain land that backed onto a stream and an alpine woodland. The result was Chalet</strong> <strong>Belle Ch&eacute;ry.</strong></h4>

<p>Belle Ch&eacute;ry is a chalet built without constraint. A destination, to be experienced. The building itself nestles into the mountain and you enter down a 15m underground tunnel that takes you from the garage and boot room into the very heart of the chalet.</p>

<p>The chalet itself is spread over 680m<sup>2</sup>, with one side of the chalet almost entirely glazed, offering mountain views from all the living spaces and entertainment areas. The chalet can sleep up to 14 guests across 5 luxurious bedrooms and a family/children&rsquo;s bunk room, all opening out onto secluded terraces and enjoying free standing baths and hanging seats.</p>

<p>The specification list for the chalet is, of course, almost endless and includes a 23-meter indoor-outdoor swimming pool, a Bamford Spa including treatment room and sauna, a private gym, a cinema room, art gallery and children&rsquo;s playroom. The living space is vast and includes a luxurious lounge area with open fireplace and delicious sofas, a library, a floating mezzanine dining area and a bar mezzanine with balcony overlooking the mountains.</p>
下一步是将其分解为内容,以便我可以解析它们

In [105]: html.contents                                                                                                                                                                    
Out[105]: 
[<h4><strong>“Create a space I would be truly excited to stay in”.</strong></h4>,
'\n',
<h4><strong>That was the brief given to renowned architect, Herve Marullaz, after Chalet Joux Plane’s owner secured a large plot of mountain land that backed onto a stream and an alpine woodland. The result was Chalet</strong> <strong>Belle Chéry.</strong></h4>,
'\n',
<p>Belle Chéry is a chalet built without constraint. A destination, to be experienced. The building itself nestles into the mountain and you enter down a 15m underground tunnel that takes you from the garage and boot room into the very heart of the chalet.</p>,
'\n',
<p>The chalet itself is spread over 680m<sup>2</sup>, with one side of the chalet almost entirely glazed, offering mountain views from all the living spaces and entertainment areas. The chalet can sleep up to 14 guests across 5 luxurious bedrooms and a family/children’s bunk room, all opening out onto secluded terraces and enjoying free standing baths and hanging seats.</p>,
'\n',
<p>The specification list for the chalet is, of course, almost endless and includes a 23-meter indoor-outdoor swimming pool, a Bamford Spa including treatment room and sauna, a private gym, a cinema room, art gallery and children’s playroom. The living space is vast and includes a luxurious lounge area with open fireplace and delicious sofas, a library, a floating mezzanine dining area and a bar mezzanine with balcony overlooking the mountains.</p>]
在这两行之间的东西是新的,我可以用try-and-catch块忽略它们,但是获取字符串似乎只对其中的一些行有效,而不是全部行

In [107]: contents[0]                                                                                                                                                                      
Out[107]: <h4><strong>“Create a space I would be truly excited to stay in”.</strong></h4>

In [108]: contents[0].string                                                                                                                                                               
Out[108]: '“Create a space I would be truly excited to stay in”.'

In [109]: contents[1]                                                                                                                                                                      
Out[109]: '\n'

In [110]: contents[1].string                                                                                                                                                               
Out[110]: '\n'

In [111]: contents[2]                                                                                                                                                                      
Out[111]: <h4><strong>That was the brief given to renowned architect, Herve Marullaz, after Chalet Joux Plane’s owner secured a large plot of mountain land that backed onto a stream and an alpine woodland. The result was Chalet</strong> <strong>Belle Chéry.</strong></h4>

In [112]: contents[2].string    
如果您知道如何提取这些部分,而不会在它们之间剥离标记,那么replace将处理主字符串。

使用.stripped_strings属性从HTML中获取干净的剥离文本

从bs4导入BeautifulSoup 从pprint导入pprint html='1〕 &ldquo;创建一个我非常乐意呆在其中的空间&rdquo;。 这是著名建筑师Herve Marullaz在Chalet Joux Plane之后的简介;他的主人获得了一大片山地,这片山地背靠着一条小溪和一片高山林地。结果是贝勒切特小屋;里。 贝勒切特;里约热内卢是一座没有约束的小屋。目的地,有待体验。这座建筑本身就坐落在山上,你进入一条15米长的地下隧道,从车库和后备箱进入小屋的核心

小屋本身占地680m2,小屋一侧几乎完全上釉,从所有生活空间和娱乐区都可以看到山景。小屋可容纳多达14位客人入住5间豪华卧室和一个家庭/儿童&rsquo;s的双层房,所有的房间都通向隐蔽的露台,享受独立式浴缸和悬挂式座椅

当然,小屋的规格几乎是无穷无尽的,包括一个23米长的室内外游泳池、一个包括治疗室和桑拿的班福德水疗中心、一个私人健身房、一个电影院、一个艺术画廊和儿童房;这是游戏室。生活空间广阔,包括一个带开放式壁炉和美味沙发的豪华休息室、一个图书馆、一个浮动夹层餐厅和一个带阳台的酒吧夹层,可以俯瞰群山

' soup=BeautifulSouphtml,“html.parser” text=[*soup.stripped\u字符串] PPrintText 输出:

['“Create a space I would be truly excited to stay in”.',
 'That was the brief given to renowned architect, Herve Marullaz, after Chalet '
 'Joux Plane’s owner secured a large plot of mountain land that backed onto a '
 'stream and an alpine woodland. The result was Chalet',
 'Belle Chéry.',
 'Belle Chéry is a chalet built without constraint. A destination, to be '
...
“Create a space I would be truly excited to stay in”. That was the brief given to renowned architect, Herve Marullaz, after Chalet Joux Plane’s owner secured a large plot of mountain land that backed onto a stream and an alpine woodland. The result was Chalet Belle C ...
要获取单个长字符串,请执行以下操作:

long_string = ' '.join(texts)
输出:

['“Create a space I would be truly excited to stay in”.',
 'That was the brief given to renowned architect, Herve Marullaz, after Chalet '
 'Joux Plane’s owner secured a large plot of mountain land that backed onto a '
 'stream and an alpine woodland. The result was Chalet',
 'Belle Chéry.',
 'Belle Chéry is a chalet built without constraint. A destination, to be '
...
“Create a space I would be truly excited to stay in”. That was the brief given to renowned architect, Herve Marullaz, after Chalet Joux Plane’s owner secured a large plot of mountain land that backed onto a stream and an alpine woodland. The result was Chalet Belle C ...

您可以使用列表comp和str.join来加入不带换行符的内容列表,以获得所需的输出:

contents = ''.join([data for data in html.contents if data != '\n'])
现在,您可以创建汤:

soup = BeautifulSoup(contents, 'lxml')
用首选解析器替换lxml。

如何获得html字符串的输出?这个输出很好地处理了标记。strsoup将为您提供HTML