Python Beautifulsoup无需下一个标记即可获取内容_Python_Python 3.x_Beautifulsoup_Jupyter Notebook

Python Beautifulsoup无需下一个标记即可获取内容

python python-3.x jupyter-notebook

Python Beautifulsoup无需下一个标记即可获取内容,python,python-3.x,beautifulsoup,jupyter-notebook,Python,Python 3.x,Beautifulsoup,Jupyter Notebook,我有一些像这样的html代码 abc123 abc123 我使用了Beautifulsoup，下面是我的代码： html = 'abc123' soup1 = BeautifulSoup(html,"lxml") p = soup1

我有一些像这样的html代码

<p><span class="map-sub-title">abc</span>123</p>

abc123

我使用了Beautifulsoup，下面是我的代码：

html = '<p><span class="map-sub-title">abc</span>123</p>'
soup1 = BeautifulSoup(html,"lxml")
p = soup1.text

html='abc123'
soup1=BeautifulSoup（html，“lxml”）
p=soup1.text

我得到的结果是‘abc123’

但是我想得到的结果是“123”而不是“abc123”

您可以使用该函数删除span标记，然后获得所需的文本

from bs4 import BeautifulSoup

html = '<p><span class="map-sub-title">abc</span>123</p>'
soup = BeautifulSoup(html, "lxml")

for span in soup.find_all("span", {'class':'map-sub-title'}):
    span.decompose()

print(soup.text)

从bs4导入美化组
html='abc123'
soup=BeautifulSoup（html，“lxml”）
对于汤中的span.find_all（“span”，{'class'：'map-sub-title'}）：
span.decompose（）
打印（soup.text）

如果标记中有多个内容，您仍然可以只查看字符串。使用

.strings

生成器：

>>> from bs4 import BeautifulSoup
>>> html = '<p><span class="map-sub-title">abc</span>123</p>'
>>> soup1 = BeautifulSoup(html,"lxml")
>>> soup1.p.strings
<generator object _all_strings at 0x00000008768C50>
>>> list(soup1.strings)
['abc', '123']
>>> list(soup1.strings)[1]
'123'

>>来自bs4导入组
>>>html='abc123'
>>>soup1=BeautifulSoup（html，“lxml”）
>>>soup1.p.strings
>>>列表（soup1.strings）
['abc'，'123']
>>>列表（soup1.strings）[1]
'123'

您还可以使用

extract（）

删除不需要的标记，然后再从标记中获取文本，如下所示

from bs4 import BeautifulSoup

html = '<p><span class="map-sub-title">abc</span>123</p>'
soup1 = BeautifulSoup(html,"lxml")
soup1.p.span.extract()

print(soup1.text)

从bs4导入美化组
html='abc123'
soup1=BeautifulSoup（html，“lxml”）
soup1.p.span.extract（）
打印（soup1.text）

许多方法之一是在父标记上使用

内容（在本例中是
）
如果您知道字符串的位置，可以直接使用：
>>> from bs4 import BeautifulSoup, NavigableString
>>> soup = BeautifulSoup('<p><span class="map-sub-title">abc</span>123</p>', 'lxml')
>>> # check the contents
... soup.find('p').contents
[<span class="map-sub-title">abc</span>, '123']
>>> soup.find('p').contents[1]
'123'

使用第二种方法，您将能够获得直接作为
标记子项的所有文本。为了完整起见，这里还有一个例子：
>>> html = '''
... <p>
...     I want
...     <span class="map-sub-title">abc</span>
...     foo
...     <span class="map-sub-title">abc2</span>
...     text
...     <span class="map-sub-title">abc3</span>
...     only
... </p>
... '''
>>> soup = BeautifulSoup(html, 'lxml')
>>> ' '.join([x.strip() for x in soup.find('p').contents if isinstance(x, NavigableString)])
'I want foo text only'

>html=''
... 
...     我想要
...     abc
...     福
...     abc2
...     文本
...     abc3
...     只有
... 
... '''
>>>soup=BeautifulSoup（html，“lxml”）
>>>''.join（[x.strip（）表示汤中的x.find（'p'）。如果存在内容（x，navigablesting）]）
'我只想要foo文本'
尽管此线程上的每个响应似乎都是可以接受的，但我将指出另一种解决此问题的方法：
soup.find（“span”，{'class'：'map-sub-title'}）。下一个兄弟姐妹

您可以使用next\u sibling
在同一parent
上的元素之间导航，在本例中是p
标记
>>> html = '''
... <p>
...     I want
...     <span class="map-sub-title">abc</span>
...     foo
...     <span class="map-sub-title">abc2</span>
...     text
...     <span class="map-sub-title">abc3</span>
...     only
... </p>
... '''
>>> soup = BeautifulSoup(html, 'lxml')
>>> ' '.join([x.strip() for x in soup.find('p').contents if isinstance(x, NavigableString)])
'I want foo text only'