Python Beautifulsoup无需下一个标记即可获取内容
我有一些像这样的html代码Python Beautifulsoup无需下一个标记即可获取内容,python,python-3.x,beautifulsoup,jupyter-notebook,Python,Python 3.x,Beautifulsoup,Jupyter Notebook,我有一些像这样的html代码 <p><span class="map-sub-title">abc</span>123</p> abc123 我使用了Beautifulsoup,下面是我的代码: html = '<p><span class="map-sub-title">abc</span>123</p>' soup1 = BeautifulSoup(html,"lxml") p = soup1
<p><span class="map-sub-title">abc</span>123</p>
abc123
我使用了Beautifulsoup,下面是我的代码:
html = '<p><span class="map-sub-title">abc</span>123</p>'
soup1 = BeautifulSoup(html,"lxml")
p = soup1.text
html='abc123'
soup1=BeautifulSoup(html,“lxml”)
p=soup1.text
我得到的结果是‘abc123’
但是我想得到的结果是“123”而不是“abc123”您可以使用该函数删除span标记,然后获得所需的文本
from bs4 import BeautifulSoup
html = '<p><span class="map-sub-title">abc</span>123</p>'
soup = BeautifulSoup(html, "lxml")
for span in soup.find_all("span", {'class':'map-sub-title'}):
span.decompose()
print(soup.text)
从bs4导入美化组
html='abc123'
soup=BeautifulSoup(html,“lxml”)
对于汤中的span.find_all(“span”,{'class':'map-sub-title'}):
span.decompose()
打印(soup.text)
如果标记中有多个内容,您仍然可以只查看字符串。使用.strings
生成器:
>>> from bs4 import BeautifulSoup
>>> html = '<p><span class="map-sub-title">abc</span>123</p>'
>>> soup1 = BeautifulSoup(html,"lxml")
>>> soup1.p.strings
<generator object _all_strings at 0x00000008768C50>
>>> list(soup1.strings)
['abc', '123']
>>> list(soup1.strings)[1]
'123'
>>来自bs4导入组
>>>html='abc123'
>>>soup1=BeautifulSoup(html,“lxml”)
>>>soup1.p.strings
>>>列表(soup1.strings)
['abc','123']
>>>列表(soup1.strings)[1]
'123'
您还可以使用extract()
删除不需要的标记,然后再从标记中获取文本,如下所示
from bs4 import BeautifulSoup
html = '<p><span class="map-sub-title">abc</span>123</p>'
soup1 = BeautifulSoup(html,"lxml")
soup1.p.span.extract()
print(soup1.text)
从bs4导入美化组
html='abc123'
soup1=BeautifulSoup(html,“lxml”)
soup1.p.span.extract()
打印(soup1.text)
许多方法之一是在父标记上使用内容(在本例中是
)
如果您知道字符串的位置,可以直接使用:
>>> from bs4 import BeautifulSoup, NavigableString
>>> soup = BeautifulSoup('<p><span class="map-sub-title">abc</span>123</p>', 'lxml')
>>> # check the contents
... soup.find('p').contents
[<span class="map-sub-title">abc</span>, '123']
>>> soup.find('p').contents[1]
'123'
使用第二种方法,您将能够获得直接作为
标记子项的所有文本。为了完整起见,这里还有一个例子:
>>> html = '''
... <p>
... I want
... <span class="map-sub-title">abc</span>
... foo
... <span class="map-sub-title">abc2</span>
... text
... <span class="map-sub-title">abc3</span>
... only
... </p>
... '''
>>> soup = BeautifulSoup(html, 'lxml')
>>> ' '.join([x.strip() for x in soup.find('p').contents if isinstance(x, NavigableString)])
'I want foo text only'
>html=''
...
... 我想要
... abc
... 福
... abc2
... 文本
... abc3
... 只有
...
... '''
>>>soup=BeautifulSoup(html,“lxml”)
>>>''.join([x.strip()表示汤中的x.find('p')。如果存在内容(x,navigablesting)])
'我只想要foo文本'
尽管此线程上的每个响应似乎都是可以接受的,但我将指出另一种解决此问题的方法:
soup.find(“span”,{'class':'map-sub-title'})。下一个兄弟姐妹
您可以使用next\u sibling
在同一parent
上的元素之间导航,在本例中是p
标记
>>> html = '''
... <p>
... I want
... <span class="map-sub-title">abc</span>
... foo
... <span class="map-sub-title">abc2</span>
... text
... <span class="map-sub-title">abc3</span>
... only
... </p>
... '''
>>> soup = BeautifulSoup(html, 'lxml')
>>> ' '.join([x.strip() for x in soup.find('p').contents if isinstance(x, NavigableString)])
'I want foo text only'