python beautifulsoup提取文本
我想提取粗体文本,它表示本网站的最新天气psi。 有人知道如何使用下面的代码提取吗 我还需要提取当前天气psi之前的两个值来进行计算。三个值的总和(最近和前两个值) 示例:当前值(粗体)是5AM:51,我还需要凌晨3点和凌晨4点。有人知道并且能帮我吗?提前谢谢python beautifulsoup提取文本,python,beautifulsoup,extract,extraction,Python,Beautifulsoup,Extract,Extraction,我想提取粗体文本,它表示本网站的最新天气psi。 有人知道如何使用下面的代码提取吗 我还需要提取当前天气psi之前的两个值来进行计算。三个值的总和(最近和前两个值) 示例:当前值(粗体)是5AM:51,我还需要凌晨3点和凌晨4点。有人知道并且能帮我吗?提前谢谢 from pprint import pprint import urllib2 from bs4 import BeautifulSoup as soup url = "http://app2.nea
from pprint import pprint
import urllib2
from bs4 import BeautifulSoup as soup
url = "http://app2.nea.gov.sg/anti-pollution-radiation-protection/air-pollution/psi/psi-readings-over-the-last-24-hours"
web_soup = soup(urllib2.urlopen(url))
table = web_soup.find(name="div", attrs={'class': 'c1'}).find_all(name="div")[2].find_all('table')[0]
table_rows = []
for row in table.find_all('tr'):
table_rows.append([td.text.strip() for td in row.find_all('td')])
data = {}
for tr_index, tr in enumerate(table_rows):
if tr_index % 2 == 0:
for td_index, td in enumerate(tr):
data[td] = table_rows[tr_index + 1][td_index]
pprint(data)
印刷品:
{'10AM': '49',
'10PM': '-',
'11AM': '52',
'11PM': '-',
'12AM': '76',
'12PM': '54',
'1AM': '70',
'1PM': '59',
'2AM': '64',
'2PM': '65',
'3AM': '59',
'3PM': '72',
'4AM': '54',
'4PM': '79',
'5AM': '51',
'5PM': '82',
'6AM': '48',
'6PM': '79',
'7AM': '47',
'7PM': '-',
'8AM': '47',
'8PM': '-',
'9AM': '47',
'9PM': '-',
'Time': '3-hr PSI'}
此代码(请参见带有#已更改的文本的行)
给你
[[u'Time', u'3-hr PSI'],
[u'12AM', u'57'],
[u'1AM', u'-'],
[u'2AM', u'-'],
[u'3AM', u'-'],
[u'4AM', u'-'],
[u'5AM', u'-'],
[u'6AM', u'-'],
[u'7AM', u'-'],
[u'8AM', u'-'],
[u'9AM', u'-'],
[u'10AM', u'-'],
[u'11AM', u'-'],
[u'Time', u'3-hr PSI'],
[u'12PM', u'-'],
[u'1PM', u'-'],
[u'2PM', u'-'],
[u'3PM', u'-'],
[u'4PM', u'-'],
[u'5PM', u'-'],
[u'6PM', u'-'],
[u'7PM', u'-'],
[u'8PM', u'-'],
[u'9PM', u'-'],
[u'10PM', u'-'],
[u'11PM', u'-']]
和打印数据[4:7]
给你
[[u'3AM', u'-'], [u'4AM', u'-'], [u'5AM', u'-']]
此代码(请参见带有#已更改的文本的行)
给你
[[u'Time', u'3-hr PSI'],
[u'12AM', u'57'],
[u'1AM', u'-'],
[u'2AM', u'-'],
[u'3AM', u'-'],
[u'4AM', u'-'],
[u'5AM', u'-'],
[u'6AM', u'-'],
[u'7AM', u'-'],
[u'8AM', u'-'],
[u'9AM', u'-'],
[u'10AM', u'-'],
[u'11AM', u'-'],
[u'Time', u'3-hr PSI'],
[u'12PM', u'-'],
[u'1PM', u'-'],
[u'2PM', u'-'],
[u'3PM', u'-'],
[u'4PM', u'-'],
[u'5PM', u'-'],
[u'6PM', u'-'],
[u'7PM', u'-'],
[u'8PM', u'-'],
[u'9PM', u'-'],
[u'10PM', u'-'],
[u'11PM', u'-']]
和打印数据[4:7]
给你
[[u'3AM', u'-'], [u'4AM', u'-'], [u'5AM', u'-']]
确保您了解这里发生的事情:
import urllib2
import datetime
from bs4 import BeautifulSoup as soup
url = "http://app2.nea.gov.sg/anti-pollution-radiation-protection/air-pollution/psi/psi-readings-over-the-last-24-hours"
web_soup = soup(urllib2.urlopen(url))
table = web_soup.find(name="div", attrs={'class': 'c1'}).find_all(name="div")[2].find_all('table')[0]
data = {}
bold_time = ''
cur_time = datetime.datetime.strptime("12AM", "%I%p")
for tr_index, tr in enumerate(table.find_all('tr')):
if 'Time' in tr.text:
continue
for td_index, td in enumerate(tr.find_all('td')):
if not td_index:
continue
data[cur_time] = td.text.strip()
if td.find('strong'):
bold_time = cur_time
cur_time += datetime.timedelta(hours=1)
print data.get(bold_time) # bold
print data.get(bold_time - datetime.timedelta(hours=1)) # before bold
print data.get(bold_time - datetime.timedelta(hours=2)) # before before bold
这将打印以粗体标记的3-hr PSI
值及其前面的两个值(如果存在)
希望这能有所帮助。确保您了解这里发生的事情:
import urllib2
import datetime
from bs4 import BeautifulSoup as soup
url = "http://app2.nea.gov.sg/anti-pollution-radiation-protection/air-pollution/psi/psi-readings-over-the-last-24-hours"
web_soup = soup(urllib2.urlopen(url))
table = web_soup.find(name="div", attrs={'class': 'c1'}).find_all(name="div")[2].find_all('table')[0]
data = {}
bold_time = ''
cur_time = datetime.datetime.strptime("12AM", "%I%p")
for tr_index, tr in enumerate(table.find_all('tr')):
if 'Time' in tr.text:
continue
for td_index, td in enumerate(tr.find_all('td')):
if not td_index:
continue
data[cur_time] = td.text.strip()
if td.find('strong'):
bold_time = cur_time
cur_time += datetime.timedelta(hours=1)
print data.get(bold_time) # bold
print data.get(bold_time - datetime.timedelta(hours=1)) # before bold
print data.get(bold_time - datetime.timedelta(hours=2)) # before before bold
这将打印以粗体标记的3-hr PSI
值及其前面的两个值(如果存在)
希望对您有所帮助。有一个网站更新,您知道如何更改代码吗?当我尝试更改为table=web\u soup.find(name=“div”,attrs={'class':'c1'}.find\u all(name=“div”)[3]。find\u all('table')[0]时,它给了我一个索引器:列表索引超出范围table=web\u-soup.find(name=“div”,attrs={'class':'c1'})。find\u-all(name=“div”)[4]。find\u-all('table')[0]
应该可以工作。希望这有帮助。我需要再次帮助这个代码,当网站在12点刷新,所有数据都被清除。因此产生了bold=data[bold\U time]NameError的错误:name'bold\U time'没有定义,我们如何防止这种情况发生?当然,我已经在循环之前更新了代码:definedbold\U time
,并使用data.get(bold\U time)
。我将3个粗体时间相加以计算平均值,因此错误就出现了。我能阻止它吗?比如只取单词strong+strong+strong/3进行计算?如何添加异常?有网站更新,知道如何更改代码吗?当我尝试更改为table=web\u soup.find(name=“div”,attrs={'class':'c1'}.find\u all(name=“div”)[3]。find\u all('table')[0]时,它给了我一个索引器:列表索引超出范围table=web\u-soup.find(name=“div”,attrs={'class':'c1'})。find\u-all(name=“div”)[4]。find\u-all('table')[0]
应该可以工作。希望这有帮助。我需要再次帮助这个代码,当网站在12点刷新,所有数据都被清除。因此产生了bold=data[bold\U time]NameError的错误:name'bold\U time'没有定义,我们如何防止这种情况发生?当然,我已经在循环之前更新了代码:definedbold\U time
,并使用data.get(bold\U time)
。我将3个粗体时间相加以计算平均值,因此错误就出现了。我能阻止它吗?比如只取单词strong+strong+strong/3进行计算?如何添加异常?