使用Excel数据搜索网站以提取结果,然后循环
我在Excel电子表格中有8000个值 我需要搜索一个网站,然后在Excel电子表格中记录从网站到的特定数据行。使用Excel数据搜索网站以提取结果,然后循环,excel,vba,web,web-scraping,Excel,Vba,Web,Web Scraping,我在Excel电子表格中有8000个值 我需要搜索一个网站,然后在Excel电子表格中记录从网站到的特定数据行。 我找到了搜索数据的代码 它像这样从网站收集数据。 我只需要“实体类型”数据行 我找不到如何扩展代码,只抓取这一行并输入到相应的单元格。i、 e.ABN(b2)搜索,找到输入的“实体类型”并粘贴到公司类型(c2)中 或者,我试图找到如何垂直而不是水平地填充信息。我可以删除不需要的列。我想这可能更简单 我试着和开发者一起录制宏 我还需要循环到下一个ABN并填充相应的字段,以此类推(B
我找到了搜索数据的代码 它像这样从网站收集数据。
我只需要“实体类型”数据行 我找不到如何扩展代码,只抓取这一行并输入到相应的单元格。i、 e.ABN(b2)搜索,找到输入的“实体类型”并粘贴到公司类型(c2)中 或者,我试图找到如何垂直而不是水平地填充信息。我可以删除不需要的列。我想这可能更简单 我试着和开发者一起录制宏
我还需要循环到下一个ABN并填充相应的字段,以此类推(B3>C3、B4>C4等)。这是绝对可能的。我经常发现,最难的部分是从另一个平台获取信息。为了实现这一点,我将把它分开一点,为了简单起见,使用两张表(表1包含已知数据,表2包含web数据) 循环浏览约8000家企业的列表。我们可以通过UsedRange行数来确定这一点。我们知道ABN在第2列(也称为B),所以我们将其复制到变量中以传递给函数。函数将把“实体类型:”返回到同一行的第3(C)列
Sub LoopThroughBusinesses()
Dim i As Integer
Dim ABN As String
For i = 2 To Sheet1.UsedRange.Rows.Count
ABN = Sheet1.Cells(i, 2)
Sheet1.Cells(i, 3) = URL_Get_ABN_Query(ABN)
Next i
End Sub
将您创建的子例程更改为函数,使其返回您要查找的实体类型。该函数将数据保存到Sheet2中,然后只返回我们要查找的实体数据
Function URL_Get_ABN_Query(strSearch As String) As String ' Change it from a Sub to a Function that returns the desired string
' strSearch = Range("a1") ' This is now passed as a parameter into the Function
Dim entityRange As Range
With Sheet2.QueryTables.Add( _
Connection:="URL;http://www.abr.business.gov.au/SearchByABN.aspx?SearchText=" & strSearch & "&safe=active", _
Destination:=Sheet2.Range("A1")) ' Change this destination to Sheet2
.BackgroundQuery = True
.TablesOnlyFromHTML = True
.Refresh BackgroundQuery:=False
.SaveData = True
End With
' Find the Range that has "Entity Type:"
Set entityRange = Sheet2.UsedRange.Find("Entity type:")
' Then return the value of the cell to its' right
URL_Get_ABN_Query = entityRange.Offset(0, 1).Value2
' Clear Sheet2 for the next run
Sheet2.UsedRange.Delete
End Function
这是绝对可能的。我经常发现,最难的部分是从另一个平台获取信息。为了实现这一点,我将把它分开一点,为了简单起见,使用两张表(表1包含已知数据,表2包含web数据) 循环浏览约8000家企业的列表。我们可以通过UsedRange行数来确定这一点。我们知道ABN在第2列(也称为B),所以我们将其复制到变量中以传递给函数。函数将把“实体类型:”返回到同一行的第3(C)列
Sub LoopThroughBusinesses()
Dim i As Integer
Dim ABN As String
For i = 2 To Sheet1.UsedRange.Rows.Count
ABN = Sheet1.Cells(i, 2)
Sheet1.Cells(i, 3) = URL_Get_ABN_Query(ABN)
Next i
End Sub
将您创建的子例程更改为函数,使其返回您要查找的实体类型。该函数将数据保存到Sheet2中,然后只返回我们要查找的实体数据
Function URL_Get_ABN_Query(strSearch As String) As String ' Change it from a Sub to a Function that returns the desired string
' strSearch = Range("a1") ' This is now passed as a parameter into the Function
Dim entityRange As Range
With Sheet2.QueryTables.Add( _
Connection:="URL;http://www.abr.business.gov.au/SearchByABN.aspx?SearchText=" & strSearch & "&safe=active", _
Destination:=Sheet2.Range("A1")) ' Change this destination to Sheet2
.BackgroundQuery = True
.TablesOnlyFromHTML = True
.Refresh BackgroundQuery:=False
.SaveData = True
End With
' Find the Range that has "Entity Type:"
Set entityRange = Sheet2.UsedRange.Find("Entity type:")
' Then return the value of the cell to its' right
URL_Get_ABN_Query = entityRange.Offset(0, 1).Value2
' Clear Sheet2 for the next run
Sheet2.UsedRange.Delete
End Function
您不希望以这种方式设置大量连接(查询表)。如果可能的话,它将是如此缓慢。在8000个请求时,如果xmlhttp没有被阻止或限制,下面的方法将大大加快速度。如果确实出现减速/阻塞,则每隔x个请求添加一个小的等待 如果可能,使用xmlhttp收集数据。用于专门针对实体类型。将值存储在数组中,并在末尾使用循环进行写出。使用类来保存xmlhttp对象以提高效率。为类提供方法,包括如何处理not found(给出的示例)。添加一些进一步的优化,例如给定的是关闭屏幕更新。这假设您的搜索号码位于B2的B列中。下面的代码还对B列中是否存在某些内容进行一些基本检查,并处理存在1个或多个数字的情况 好的代码是模块化的,您需要一个函数来返回某些内容,并需要一个子函数来执行操作。单个子功能不应完成大量任务。您希望使用遵循(或接近)原则的代码轻松调试 类别clsHTTP
Option Explicit
Private http As Object
Private Sub Class_Initialize()
Set http = CreateObject("MSXML2.XMLHTTP")
End Sub
Public Function GetHTML(ByVal URL As String) As String
Dim sResponse As String
With http
.Open "GET", URL, False
.send
GetHTML = StrConv(.responseBody, vbUnicode)
End With
End Function
Public Function GetEntityType(ByVal html As HTMLDocument) As String
On Error GoTo errhand:
GetEntityType = html.querySelector("a[href*='EntityTypeDescription']").innerText
Exit Function
errhand:
GetEntityType = "Not Found"
End Function
标准模块:
Option Explicit
Public Sub GetInfo()
Dim http As clsHTTP, sResponse As String, lastRow As Long, groupResults(), i As Long, html As HTMLDocument
Set html = New HTMLDocument
Set http = New clsHTTP
Const BASE_URL As String = "http://www.abr.business.gov.au/ABN/View/"
Application.ScreenUpdating = False
With ThisWorkbook.Worksheets("Sheet1")
lastRow = .Cells(.rows.Count, "B").End(xlUp).Row
Select Case lastRow
Case 1
Exit Sub
Case 2
ReDim arr(1, 1): arr(1, 1) = .Range("B2").Value
Case Else
arr = .Range("B2:B" & lastRow).Value
End Select
ReDim groupResults(1 To lastRow - 1)
With http
For i = LBound(arr, 1) To UBound(arr, 1)
If Len(BASE_URL & arr(i, 1)) > Len(BASE_URL) Then
sResponse = .GetHTML(BASE_URL & arr(i, 1))
html.body.innerHTML = sResponse
groupResults(i) = .GetEntityType(html)
sResponse = vbNullString: html.body.innerHTML = vbNullString
End If
Next
End With
For i = LBound(groupResults) To UBound(groupResults)
.Cells(i + 1, "C") = groupResults(i)
Next
End With
Application.ScreenUpdating = True
End Sub
参考资料(VBE>工具>参考资料):
Option Explicit
Public Sub GetInfo()
Dim http As clsHTTP, sResponse As String, lastRow As Long, groupResults(), i As Long, html As HTMLDocument
Set html = New HTMLDocument
Set http = New clsHTTP
Const BASE_URL As String = "http://www.abr.business.gov.au/ABN/View/"
Application.ScreenUpdating = False
With ThisWorkbook.Worksheets("Sheet1")
lastRow = .Cells(.rows.Count, "B").End(xlUp).Row
Select Case lastRow
Case 1
Exit Sub
Case 2
ReDim arr(1, 1): arr(1, 1) = .Range("B2").Value
Case Else
arr = .Range("B2:B" & lastRow).Value
End Select
ReDim groupResults(1 To lastRow - 1)
With http
For i = LBound(arr, 1) To UBound(arr, 1)
If Len(BASE_URL & arr(i, 1)) > Len(BASE_URL) Then
sResponse = .GetHTML(BASE_URL & arr(i, 1))
html.body.innerHTML = sResponse
groupResults(i) = .GetEntityType(html)
sResponse = vbNullString: html.body.innerHTML = vbNullString
End If
Next
End With
For i = LBound(groupResults) To UBound(groupResults)
.Cells(i + 1, "C") = groupResults(i)
Next
End With
Application.ScreenUpdating = True
End Sub
CSS选择器:
Option Explicit
Public Sub GetInfo()
Dim http As clsHTTP, sResponse As String, lastRow As Long, groupResults(), i As Long, html As HTMLDocument
Set html = New HTMLDocument
Set http = New clsHTTP
Const BASE_URL As String = "http://www.abr.business.gov.au/ABN/View/"
Application.ScreenUpdating = False
With ThisWorkbook.Worksheets("Sheet1")
lastRow = .Cells(.rows.Count, "B").End(xlUp).Row
Select Case lastRow
Case 1
Exit Sub
Case 2
ReDim arr(1, 1): arr(1, 1) = .Range("B2").Value
Case Else
arr = .Range("B2:B" & lastRow).Value
End Select
ReDim groupResults(1 To lastRow - 1)
With http
For i = LBound(arr, 1) To UBound(arr, 1)
If Len(BASE_URL & arr(i, 1)) > Len(BASE_URL) Then
sResponse = .GetHTML(BASE_URL & arr(i, 1))
html.body.innerHTML = sResponse
groupResults(i) = .GetEntityType(html)
sResponse = vbNullString: html.body.innerHTML = vbNullString
End If
Next
End With
For i = LBound(groupResults) To UBound(groupResults)
.Cells(i + 1, "C") = groupResults(i)
Next
End With
Application.ScreenUpdating = True
End Sub
我使用实体描述是一个超链接(a
tag)并且其值包含字符串EntityTypeDescription
的事实来使用css属性=值和contains(*)操作符作为目标
您不希望以这种方式设置大量连接(查询表)。如果可能的话,它将是如此缓慢。在8000个请求时,如果xmlhttp没有被阻止或限制,下面的方法将大大加快速度。如果确实出现减速/阻塞,则每隔x个请求添加一个小的等待 如果可能,使用xmlhttp收集数据。用于专门针对实体类型。将值存储在数组中,并在末尾使用循环进行写出。使用类来保存xmlhttp对象以提高效率。为类提供方法,包括如何处理not found(给出的示例)。添加一些进一步的优化,例如给定的是关闭屏幕更新。这假设您的搜索号码位于B2的B列中。下面的代码还对B列中是否存在某些内容进行一些基本检查,并处理存在1个或多个数字的情况 好的代码是模块化的,您需要一个函数来返回某些内容,并需要一个子函数来执行操作。单个子功能不应完成大量任务。您希望使用遵循(或接近)原则的代码轻松调试 类别clsHTTP
Option Explicit
Private http As Object
Private Sub Class_Initialize()
Set http = CreateObject("MSXML2.XMLHTTP")
End Sub
Public Function GetHTML(ByVal URL As String) As String
Dim sResponse As String
With http
.Open "GET", URL, False
.send
GetHTML = StrConv(.responseBody, vbUnicode)
End With
End Function
Public Function GetEntityType(ByVal html As HTMLDocument) As String
On Error GoTo errhand:
GetEntityType = html.querySelector("a[href*='EntityTypeDescription']").innerText
Exit Function
errhand:
GetEntityType = "Not Found"
End Function
标准模块:
Option Explicit
Public Sub GetInfo()
Dim http As clsHTTP, sResponse As String, lastRow As Long, groupResults(), i As Long, html As HTMLDocument
Set html = New HTMLDocument
Set http = New clsHTTP
Const BASE_URL As String = "http://www.abr.business.gov.au/ABN/View/"
Application.ScreenUpdating = False
With ThisWorkbook.Worksheets("Sheet1")
lastRow = .Cells(.rows.Count, "B").End(xlUp).Row
Select Case lastRow
Case 1
Exit Sub
Case 2
ReDim arr(1, 1): arr(1, 1) = .Range("B2").Value
Case Else
arr = .Range("B2:B" & lastRow).Value
End Select
ReDim groupResults(1 To lastRow - 1)
With http
For i = LBound(arr, 1) To UBound(arr, 1)
If Len(BASE_URL & arr(i, 1)) > Len(BASE_URL) Then
sResponse = .GetHTML(BASE_URL & arr(i, 1))
html.body.innerHTML = sResponse
groupResults(i) = .GetEntityType(html)
sResponse = vbNullString: html.body.innerHTML = vbNullString
End If
Next
End With
For i = LBound(groupResults) To UBound(groupResults)
.Cells(i + 1, "C") = groupResults(i)
Next
End With
Application.ScreenUpdating = True
End Sub
参考资料(VBE>工具>参考资料):
Option Explicit
Public Sub GetInfo()
Dim http As clsHTTP, sResponse As String, lastRow As Long, groupResults(), i As Long, html As HTMLDocument
Set html = New HTMLDocument
Set http = New clsHTTP
Const BASE_URL As String = "http://www.abr.business.gov.au/ABN/View/"
Application.ScreenUpdating = False
With ThisWorkbook.Worksheets("Sheet1")
lastRow = .Cells(.rows.Count, "B").End(xlUp).Row
Select Case lastRow
Case 1
Exit Sub
Case 2
ReDim arr(1, 1): arr(1, 1) = .Range("B2").Value
Case Else
arr = .Range("B2:B" & lastRow).Value
End Select
ReDim groupResults(1 To lastRow - 1)
With http
For i = LBound(arr, 1) To UBound(arr, 1)
If Len(BASE_URL & arr(i, 1)) > Len(BASE_URL) Then
sResponse = .GetHTML(BASE_URL & arr(i, 1))
html.body.innerHTML = sResponse
groupResults(i) = .GetEntityType(html)
sResponse = vbNullString: html.body.innerHTML = vbNullString
End If
Next
End With
For i = LBound(groupResults) To UBound(groupResults)
.Cells(i + 1, "C") = groupResults(i)
Next
End With
Application.ScreenUpdating = True
End Sub
CSS选择器:
Option Explicit
Public Sub GetInfo()
Dim http As clsHTTP, sResponse As String, lastRow As Long, groupResults(), i As Long, html As HTMLDocument
Set html = New HTMLDocument
Set http = New clsHTTP
Const BASE_URL As String = "http://www.abr.business.gov.au/ABN/View/"
Application.ScreenUpdating = False
With ThisWorkbook.Worksheets("Sheet1")
lastRow = .Cells(.rows.Count, "B").End(xlUp).Row
Select Case lastRow
Case 1
Exit Sub
Case 2
ReDim arr(1, 1): arr(1, 1) = .Range("B2").Value
Case Else
arr = .Range("B2:B" & lastRow).Value
End Select
ReDim groupResults(1 To lastRow - 1)
With http
For i = LBound(arr, 1) To UBound(arr, 1)
If Len(BASE_URL & arr(i, 1)) > Len(BASE_URL) Then
sResponse = .GetHTML(BASE_URL & arr(i, 1))
html.body.innerHTML = sResponse
groupResults(i) = .GetEntityType(html)
sResponse = vbNullString: html.body.innerHTML = vbNullString
End If
Next
End With
For i = LBound(groupResults) To UBound(groupResults)
.Cells(i + 1, "C") = groupResults(i)
Next
End With
Application.ScreenUpdating = True
End Sub
我使用实体描述是一个超链接(a
tag)并且其值包含字符串EntityTypeDescription
的事实来使用css属性=值和contains(*)操作符作为目标
请注意,这需要输入到模块中,而不是编码到图纸对象中。此外,如果代码不适用于Sheet1和Sheet2,则您可以将其更改为工作表(“名称”),其中名称是您在工作簿底部看到的各个选项卡的名称。谢谢!我刚和一个朋友聊天,他建议放弃搜索