将HTML表的每一行读入python列表
我正在尝试使用python从web上抓取HTML表。我用漂亮的汤做这个网页刮。HTML页面中有许多表,表中有许多行。我希望每一行有一个不同的名称,如果行中有列,希望它们是分开的 我的代码如下所示:将HTML表的每一行读入python列表,python,html,web-scraping,Python,Html,Web Scraping,我正在尝试使用python从web上抓取HTML表。我用漂亮的汤做这个网页刮。HTML页面中有许多表,表中有许多行。我希望每一行有一个不同的名称,如果行中有列,希望它们是分开的 我的代码如下所示: page = get("https://www.4dpredict.com/mysingaporetoto.p3.html") html = BeautifulSoup(page.content, 'html.parser') result = defaultdict(list)
page = get("https://www.4dpredict.com/mysingaporetoto.p3.html")
html = BeautifulSoup(page.content, 'html.parser')
result = defaultdict(list)
tables = html.find_all('table')
for table in tables:
for row in table.find_all('tr')[0:15]:
try:
#stuck here
except ValueError:
continue # blank/empty row
需要一些指导。我建议放弃BeautifulSoup(虽然它很漂亮)并使用(在后端使用BeautifulSoup或lxml)。您描述的是熊猫的bog标准,请阅读文档。如果我正确理解您的要求,下面的脚本应该可以实现此目的:
import requests
from bs4 import BeautifulSoup
url = 'https://www.4dpredict.com/mysingaporetoto.p3.html'
res = requests.get(url).text
soup = BeautifulSoup(res, 'lxml')
num = 0
for tables in soup.select("table tr"):
num+=1
data = [f'{num}'] + [item.get_text(strip=True) for item in tables.select("td")]
print(data)
部分输出:
['1', 'SINGAPORE TOTO2018-08-23 (Thu) 3399']
['2', 'WINNING NUMBERS']
['3', '02', '03', '23', '30', '39', '41']
['4', 'ADDITIONAL']
['5', '19']
['6', 'Prize:$2,499,788']
['7', 'WINNING SHARES']
['8', 'Group', 'Share Amt', 'Winners']
['9', 'Group 1', '$1,249,894', '2']
['10', 'Group 2', '$', '-']
['11', 'Group 3', '$1,614', '124']
['12', 'Group 4', '$344', '318']
['13', 'Group 5', '$50', '6,876']
['14', 'Group 6', '$25', '9,092']
我建议使用requests.get()而不是get()方法,请检查下面的代码,如果不起作用,请告诉我
import requests
from bs4 import BeautifulSoup
import pprint
page = requests.get("https://www.4dpredict.com/mysingaporetoto.p3.html")
html = BeautifulSoup(page.content, 'html.parser')
tables = html.find_all('table')
table_data = dict()
for table_id, table in enumerate(tables):
print('[!] Scraping Table -', table_id + 1)
table_data['table_{}'.format(table_id+1)] = dict()
table_info = table_data['table_{}'.format(table_id+1)]
for row_id, row in enumerate(table.find_all('tr')):
col = []
for val in row.find_all('td'):
val = val.text
val = val.replace('\n', '').strip()
if val:
col.append(val)
table_info['row_{}'.format(row_id+1)] = col
pprint.pprint(table_info)
print('+-+' * 20)
pprint.pprint(table_data)
样本输出
[!] Scraping Table - 1
{'row_1': ['SINGAPORE TOTO2018-08-23 (Thu) 3399'],
'row_10': ['Group 2', '$', '-'],
'row_11': ['Group 3', '$1,614', '124'],
'row_12': ['Group 4', '$344', '318'],
'row_13': ['Group 5', '$50', '6,876'],
'row_14': ['Group 6', '$25', '9,092'],
'row_15': ['Group 7', '$10', '117,080'],
'row_16': ['SHOW ANALYSISEVEN : ODD, 2 : 5SUM :138, AVERAGE :23 MIN :02, MAX '
':41, DIFF :39',
'EVEN : ODD, 2 : 5',
'SUM :138, AVERAGE :23',
'MIN :02, MAX :41, DIFF :39'],
'row_17': ['EVEN : ODD, 2 : 5'],
'row_18': ['SUM :138, AVERAGE :23'],
'row_19': ['MIN :02, MAX :41, DIFF :39'],
'row_2': ['WINNING NUMBERS'],
'row_3': ['02', '03', '23', '30', '39', '41'],
'row_4': ['ADDITIONAL'],
'row_5': ['19'],
'row_6': ['Prize: $2,499,788'],
'row_7': ['WINNING SHARES'],
'row_8': ['Group', 'Share Amt', 'Winners'],
'row_9': ['Group 1', '$1,249,894', '2']}
+-++-++-++-++-++-++-++-++-++-++-++-++-++-++-++-++-++-++-++-+
你能用一些操作码来增强你的回答吗?为了明确OP问题的答案是哪行代码。OP似乎使用了
请求
库。然而,他可能从它导入了get
,就像从请求导入get导入了。我仍然找不到这个问题的答案和你的一行评论之间的任何关联..谢谢SIM的建议。。我对python和堆栈溢出都是新手。尝试学习和解决..如果您使用的是python的最新版本,我怀疑有任何错误。请检查脚本生成的输出。如何修复此问题?请尝试将此['%s'%num]
替换为[f'{num}']
。然后我无法理解您的要求。谢谢