Python 如何刮除所有p标记中的所有文本,包括span中的文本?
这就是我所剩下的Python 如何刮除所有p标记中的所有文本,包括span中的文本?,python,python-3.x,web-scraping,beautifulsoup,Python,Python 3.x,Web Scraping,Beautifulsoup,这就是我所剩下的 table = soup.findAll('div', attrs={"class":"five columns"}) for data in table: para = data.findAll('p') print para 地点:新德里/Safdarjung,当前时间:2017年2月12日上午10:29:52,最新报告:2017年2月12日上午8:30,能见度:1公里,压力:102.12千帕,湿度:95%,露点:10℃ 您可以尝试使用BeautifulS
table = soup.findAll('div', attrs={"class":"five columns"})
for data in table:
para = data.findAll('p')
print para
地点:新德里/Safdarjung,当前时间:2017年2月12日上午10:29:52,最新报告:2017年2月12日上午8:30,能见度:1公里,压力:102.12千帕,湿度:95%,露点:10℃
您可以尝试使用BeautifulSoup对象段落文本的.text
属性。我使用re.split()
函数进一步拆分了密钥对值,如果您不想拆分,只需执行para.text
<p><span class="four">Location: </span> <span id="wt-loc" title="New Delhi / Safdarjung">New Delhi / Safdarjung</span></p>, <p><span class="four">Current Time: </span> <span id="wtct">Feb 12, 2017 at 10:29:52 am</span></p>, <p><span class="four">Latest Report: </span> Feb 12, 2017 at 8:30 am</p>, <p><span class="four">Visibility: </span> 1 km</p>, <p><span class="four">Pressure: </span> 102.12 kPa</p>, <p><span class="four">Humidity: </span> 95%</p>, <p><span class="four">Dew Point: </span> 10 °C</p>
使用.text
获取p标记下的所有文本,您需要做的是迭代findAll(p)
Beauty soup有一个名为的函数,允许您忽略其他标记获取标记中的所有文本。只需调用p.获取文本()
。如果还想删除空白,请调用p.get\u text(strip=True)
from bs4 import BeautifulSoup
import re
a = """<p><span class="four">Location: </span> <span id="wt-loc" title="New Delhi / Safdarjung">New Delhi / Safdarjung</span></p>, <p><span class="four">Current Time: </span> <span id="wtct">Feb 12, 2017 at 10:29:52 am</span></p>, <p><span class="four">Latest Report: </span> Feb 12, 2017 at 8:30 am</p>, <p><span class="four">Visibility: </span> 1 km</p>, <p><span class="four">Pressure: </span> 102.12 kPa</p>, <p><span class="four">Humidity: </span> 95%</p>, <p><span class="four">Dew Point: </span> 10 °C</p>"""
soup = BeautifulSoup(a, 'html.parser')
re.split(r', (?=\s*[A-Z])', soup.text)
[u'Location: New Delhi / Safdarjung',
u'Current Time: Feb 12, 2017 at 10:29:52 am',
u'Latest Report: Feb 12, 2017 at 8:30 am',
u'Visibility: 1 km',
u'Pressure: 102.12 kPa',
u'Humidity: 95%',
u'Dew Point: 10 \uc9f8C']
from bs4 import BeautifulSoup
html = '''<p><span class="four">Location: </span> <span id="wt-loc" title="New Delhi / Safdarjung">New Delhi / Safdarjung</span></p>, <p><span class="four">Current Time: </span> <span id="wtct">Feb 12, 2017 at 10:29:52 am</span></p>, <p><span class="four">Latest Report: </span> Feb 12, 2017 at 8:30 am</p>, <p><span class="four">Visibility: </span> 1 km</p>, <p><span class="four">Pressure: </span> 102.12 kPa</p>, <p><span class="four">Humidity: </span> 95%</p>, <p><span class="four">Dew Point: </span> 10 °C</p>'''
soup = BeautifulSoup(html, 'lxml')
for p in soup.find_all('p'):
print(p.text)
Location: New Delhi / Safdarjung
Current Time: Feb 12, 2017 at 10:29:52 am
Latest Report: Feb 12, 2017 at 8:30 am
Visibility: 1 km
Pressure: 102.12 kPa
Humidity: 95%
Dew Point: 10 °C