Python 3.x 从HTML数据透视表提取数据
我有几个问题要解决:Python 3.x 从HTML数据透视表提取数据,python-3.x,selenium,iframe,html-table,Python 3.x,Selenium,Iframe,Html Table,我有几个问题要解决: 1.列1-3可能是合并的单元格,因此缺少“td”值。如果这些列不存在td,我如何填充每一行?或者,列4-7始终可见。我考虑处理这个问题的一种方法是向后循环以获得第7、6、5、4列,如果第3、2、1列不存在,则使用前一行中的前一个值 2.在第4列中,可能有多个值和一个或多个超链接。我需要提取文本和所有超链接,点击它们并下载附件 如果有比硒更好的方法,请告诉我。最终输出是将此数据表和附件填充到excel文件中 Python-Selenium代码:该代码用于按HTML代码('td
1.列1-3可能是合并的单元格,因此缺少“td”值。如果这些列不存在td,我如何填充每一行?或者,列4-7始终可见。我考虑处理这个问题的一种方法是向后循环以获得第7、6、5、4列,如果第3、2、1列不存在,则使用前一行中的前一个值
2.在第4列中,可能有多个值和一个或多个超链接。我需要提取文本和所有超链接,点击它们并下载附件 如果有比硒更好的方法,请告诉我。最终输出是将此数据表和附件填充到excel文件中 Python-Selenium代码:该代码用于按HTML代码('td')中的每一列提取每一行数据 我附上了使用BeautifulSoup提取的HTML代码,让您了解表格的外观。第一个表行有7个“td”,但后续行没有 嗯。因此,我对col代码进行了以下改进。反转循环非常有效!现在,我所有的超链接都将位于第4列。现在,我必须计算出要为每一行填充的第1-3列,打开所有超链接并将它们保存到共享驱动器上的特定文件夹中。谢谢
columncounter = 7
cols = rows.find_elements_by_tag_name("td")
for col in reversed(cols):
print('ColumnNumber = %d' %columncounter)
print(col.text)
if columncounter == 4:
colfour = col.get_attribute('innerHTML')
colfour2 = col.find_elements_by_tag_name('a')
for a in colfour2:
print(a.get_attribute('href'))
columncounter-=1
这可能不是最好的解决方案,但以下是我为实现这一点所做的:
from selenium import webdriver
from win32com.client as win32
xl = win32.gencache.EnsureDispatch('Excel.Application')
xl.Visible = 1
wb = xl.Workbooks.Open('Template.xlsx')
xl.DisplayAlerts = False
ws = wb.Worksheets('Sheet1')
tr = 10 # paste results into row 10 for excel table
for rows in row[1:] #skip header row
rtxt = rows.text.strip() #trim text to determine if row is empty
if rtxt: # used to determine if rtxt variable for empty row
columncounter = 7
cols = rows.find_elements_by_tag_name("td")
for col in reversed(cols):
if columncounter == 7:
col7 = col.text
elif columncounter == 6:
col6 = col.text
elif columncounter == 5:
col5 = col.text
elif columncounter ==4:
col4 = col.text
colfour = col.get_attribute('innerHTML') #get entire cell code
colfour2 = col.find_elements_by_tag_name('a')
for a in colfour2:
link = a.get_attribute('href') #extract hyperlink
linkh = 'https'
if linkh in link: #only want hyperlinks that start with https
for i in link:
col8+=i #extracts multiple link into same variable
elif columncounter == 3:
col3 = col.text
elif columncounter == 2:
col2 = col.text
elif columncounter == 1:
col1 = col.text
columncounter-=1
#paste all of the values in column into excel
dest_cell = ws.Range('A' + str(tr))
dest_cell.Value = col1
dest_cell = ws.Range('B' + str(tr))
dest_cell.Value = col2
dest_cell = ws.Range('C' + str(tr))
dest_cell.Value = col3
dest_cell = ws.Range('D' + str(tr))
dest_cell.Value = col4
dest_cell = ws.Range('E' + str(tr))
dest_cell.Value = col5
dest_cell = ws.Range('F' + str(tr))
dest_cell.Value = col6
dest_cell = ws.Range('G' + str(tr))
dest_cell.Value = col7
dest_cell = ws.Range('H' + str(tr))
dest_cell.Value = col8
tr+=1
else:
continue #skip empty rows
这将附加第1-3列的值,因为由于前3列中合并了单元格,所以每行的所有7列都没有“td”值。然后将每个值作为重复值粘贴到excel文件中,直到它成为根据columncounter从该列的col.text检索到的新值。请阅读原因。考虑使用格式化的基于文本的相关HTML、代码试验和错误堆栈跟踪更新问题。
from selenium import webdriver
from win32com.client as win32
xl = win32.gencache.EnsureDispatch('Excel.Application')
xl.Visible = 1
wb = xl.Workbooks.Open('Template.xlsx')
xl.DisplayAlerts = False
ws = wb.Worksheets('Sheet1')
tr = 10 # paste results into row 10 for excel table
for rows in row[1:] #skip header row
rtxt = rows.text.strip() #trim text to determine if row is empty
if rtxt: # used to determine if rtxt variable for empty row
columncounter = 7
cols = rows.find_elements_by_tag_name("td")
for col in reversed(cols):
if columncounter == 7:
col7 = col.text
elif columncounter == 6:
col6 = col.text
elif columncounter == 5:
col5 = col.text
elif columncounter ==4:
col4 = col.text
colfour = col.get_attribute('innerHTML') #get entire cell code
colfour2 = col.find_elements_by_tag_name('a')
for a in colfour2:
link = a.get_attribute('href') #extract hyperlink
linkh = 'https'
if linkh in link: #only want hyperlinks that start with https
for i in link:
col8+=i #extracts multiple link into same variable
elif columncounter == 3:
col3 = col.text
elif columncounter == 2:
col2 = col.text
elif columncounter == 1:
col1 = col.text
columncounter-=1
#paste all of the values in column into excel
dest_cell = ws.Range('A' + str(tr))
dest_cell.Value = col1
dest_cell = ws.Range('B' + str(tr))
dest_cell.Value = col2
dest_cell = ws.Range('C' + str(tr))
dest_cell.Value = col3
dest_cell = ws.Range('D' + str(tr))
dest_cell.Value = col4
dest_cell = ws.Range('E' + str(tr))
dest_cell.Value = col5
dest_cell = ws.Range('F' + str(tr))
dest_cell.Value = col6
dest_cell = ws.Range('G' + str(tr))
dest_cell.Value = col7
dest_cell = ws.Range('H' + str(tr))
dest_cell.Value = col8
tr+=1
else:
continue #skip empty rows