Warning: file_get_contents(/data/phpspider/zhask/data//catemap/5/excel/28.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
使用Excel数据搜索网站以提取结果,然后循环_Excel_Vba_Web_Web Scraping - Fatal编程技术网

使用Excel数据搜索网站以提取结果,然后循环

使用Excel数据搜索网站以提取结果,然后循环,excel,vba,web,web-scraping,Excel,Vba,Web,Web Scraping,我在Excel电子表格中有8000个值 我需要搜索一个网站,然后在Excel电子表格中记录从网站到的特定数据行。 我找到了搜索数据的代码 它像这样从网站收集数据。 我只需要“实体类型”数据行 我找不到如何扩展代码,只抓取这一行并输入到相应的单元格。i、 e.ABN(b2)搜索,找到输入的“实体类型”并粘贴到公司类型(c2)中 或者,我试图找到如何垂直而不是水平地填充信息。我可以删除不需要的列。我想这可能更简单 我试着和开发者一起录制宏 我还需要循环到下一个ABN并填充相应的字段,以此类推(B

我在Excel电子表格中有8000个值

我需要搜索一个网站,然后在Excel电子表格中记录从网站到的特定数据行。

我找到了搜索数据的代码

它像这样从网站收集数据。

我只需要“实体类型”数据行

我找不到如何扩展代码,只抓取这一行并输入到相应的单元格。i、 e.ABN(b2)搜索,找到输入的“实体类型”并粘贴到公司类型(c2)中

或者,我试图找到如何垂直而不是水平地填充信息。我可以删除不需要的列。我想这可能更简单

我试着和开发者一起录制宏


我还需要循环到下一个ABN并填充相应的字段,以此类推(B3>C3、B4>C4等)。

这是绝对可能的。我经常发现,最难的部分是从另一个平台获取信息。为了实现这一点,我将把它分开一点,为了简单起见,使用两张表(表1包含已知数据,表2包含web数据)

循环浏览约8000家企业的列表。我们可以通过UsedRange行数来确定这一点。我们知道ABN在第2列(也称为B),所以我们将其复制到变量中以传递给函数。函数将把“实体类型:”返回到同一行的第3(C)列

Sub LoopThroughBusinesses() 
    Dim i As Integer
    Dim ABN As String
    For i = 2 To Sheet1.UsedRange.Rows.Count
        ABN = Sheet1.Cells(i, 2)
        Sheet1.Cells(i, 3) = URL_Get_ABN_Query(ABN)
    Next i
End Sub
将您创建的子例程更改为函数,使其返回您要查找的实体类型。该函数将数据保存到Sheet2中,然后只返回我们要查找的实体数据

Function URL_Get_ABN_Query(strSearch As String) As String   ' Change it from a Sub to a Function that returns the desired string
    ' strSearch = Range("a1") ' This is now passed as a parameter into the Function
    Dim entityRange As Range
    With Sheet2.QueryTables.Add( _
            Connection:="URL;http://www.abr.business.gov.au/SearchByABN.aspx?SearchText=" & strSearch & "&safe=active", _
            Destination:=Sheet2.Range("A1"))   ' Change this destination to Sheet2

        .BackgroundQuery = True
        .TablesOnlyFromHTML = True
        .Refresh BackgroundQuery:=False
        .SaveData = True
    End With

    ' Find the Range that has "Entity Type:"
    Set entityRange = Sheet2.UsedRange.Find("Entity type:")

    ' Then return the value of the cell to its' right
    URL_Get_ABN_Query = entityRange.Offset(0, 1).Value2

    ' Clear Sheet2 for the next run
    Sheet2.UsedRange.Delete

End Function

这是绝对可能的。我经常发现,最难的部分是从另一个平台获取信息。为了实现这一点,我将把它分开一点,为了简单起见,使用两张表(表1包含已知数据,表2包含web数据)

循环浏览约8000家企业的列表。我们可以通过UsedRange行数来确定这一点。我们知道ABN在第2列(也称为B),所以我们将其复制到变量中以传递给函数。函数将把“实体类型:”返回到同一行的第3(C)列

Sub LoopThroughBusinesses() 
    Dim i As Integer
    Dim ABN As String
    For i = 2 To Sheet1.UsedRange.Rows.Count
        ABN = Sheet1.Cells(i, 2)
        Sheet1.Cells(i, 3) = URL_Get_ABN_Query(ABN)
    Next i
End Sub
将您创建的子例程更改为函数,使其返回您要查找的实体类型。该函数将数据保存到Sheet2中,然后只返回我们要查找的实体数据

Function URL_Get_ABN_Query(strSearch As String) As String   ' Change it from a Sub to a Function that returns the desired string
    ' strSearch = Range("a1") ' This is now passed as a parameter into the Function
    Dim entityRange As Range
    With Sheet2.QueryTables.Add( _
            Connection:="URL;http://www.abr.business.gov.au/SearchByABN.aspx?SearchText=" & strSearch & "&safe=active", _
            Destination:=Sheet2.Range("A1"))   ' Change this destination to Sheet2

        .BackgroundQuery = True
        .TablesOnlyFromHTML = True
        .Refresh BackgroundQuery:=False
        .SaveData = True
    End With

    ' Find the Range that has "Entity Type:"
    Set entityRange = Sheet2.UsedRange.Find("Entity type:")

    ' Then return the value of the cell to its' right
    URL_Get_ABN_Query = entityRange.Offset(0, 1).Value2

    ' Clear Sheet2 for the next run
    Sheet2.UsedRange.Delete

End Function

您不希望以这种方式设置大量连接(查询表)。如果可能的话,它将是如此缓慢。在8000个请求时,如果xmlhttp没有被阻止或限制,下面的方法将大大加快速度。如果确实出现减速/阻塞,则每隔x个请求添加一个小的等待

如果可能,使用xmlhttp收集数据。用于专门针对实体类型。将值存储在数组中,并在末尾使用循环进行写出。使用类来保存xmlhttp对象以提高效率。为类提供方法,包括如何处理not found(给出的示例)。添加一些进一步的优化,例如给定的是关闭屏幕更新。这假设您的搜索号码位于B2的B列中。下面的代码还对B列中是否存在某些内容进行一些基本检查,并处理存在1个或多个数字的情况

好的代码是模块化的,您需要一个函数来返回某些内容,并需要一个子函数来执行操作。单个子功能不应完成大量任务。您希望使用遵循(或接近)原则的代码轻松调试

类别clsHTTP

Option Explicit

Private http As Object  
Private Sub Class_Initialize()
    Set http = CreateObject("MSXML2.XMLHTTP")
End Sub

Public Function GetHTML(ByVal URL As String) As String
    Dim sResponse As String
    With http
        .Open "GET", URL, False
        .send
        GetHTML = StrConv(.responseBody, vbUnicode)
    End With
End Function

Public Function GetEntityType(ByVal html As HTMLDocument) As String
    On Error GoTo errhand:
     GetEntityType = html.querySelector("a[href*='EntityTypeDescription']").innerText
    Exit Function
errhand:
    GetEntityType = "Not Found"
End Function
标准模块:

Option Explicit 
Public Sub GetInfo()
    Dim http As clsHTTP, sResponse As String, lastRow As Long, groupResults(), i As Long, html As HTMLDocument
    Set html = New HTMLDocument
    Set http = New clsHTTP
    Const BASE_URL As String = "http://www.abr.business.gov.au/ABN/View/"
    Application.ScreenUpdating = False

    With ThisWorkbook.Worksheets("Sheet1")
        lastRow = .Cells(.rows.Count, "B").End(xlUp).Row
        Select Case lastRow
        Case 1
            Exit Sub
        Case 2
            ReDim arr(1, 1): arr(1, 1) = .Range("B2").Value
        Case Else
            arr = .Range("B2:B" & lastRow).Value
        End Select

        ReDim groupResults(1 To lastRow - 1)

        With http
            For i = LBound(arr, 1) To UBound(arr, 1)
                If Len(BASE_URL & arr(i, 1)) > Len(BASE_URL) Then
                    sResponse = .GetHTML(BASE_URL & arr(i, 1))
                    html.body.innerHTML = sResponse
                    groupResults(i) = .GetEntityType(html)
                    sResponse = vbNullString: html.body.innerHTML = vbNullString
                End If
            Next
        End With
        For i = LBound(groupResults) To UBound(groupResults)
            .Cells(i + 1, "C") = groupResults(i)
        Next
    End With
    Application.ScreenUpdating = True
End Sub

参考资料(VBE>工具>参考资料):

Option Explicit 
Public Sub GetInfo()
    Dim http As clsHTTP, sResponse As String, lastRow As Long, groupResults(), i As Long, html As HTMLDocument
    Set html = New HTMLDocument
    Set http = New clsHTTP
    Const BASE_URL As String = "http://www.abr.business.gov.au/ABN/View/"
    Application.ScreenUpdating = False

    With ThisWorkbook.Worksheets("Sheet1")
        lastRow = .Cells(.rows.Count, "B").End(xlUp).Row
        Select Case lastRow
        Case 1
            Exit Sub
        Case 2
            ReDim arr(1, 1): arr(1, 1) = .Range("B2").Value
        Case Else
            arr = .Range("B2:B" & lastRow).Value
        End Select

        ReDim groupResults(1 To lastRow - 1)

        With http
            For i = LBound(arr, 1) To UBound(arr, 1)
                If Len(BASE_URL & arr(i, 1)) > Len(BASE_URL) Then
                    sResponse = .GetHTML(BASE_URL & arr(i, 1))
                    html.body.innerHTML = sResponse
                    groupResults(i) = .GetEntityType(html)
                    sResponse = vbNullString: html.body.innerHTML = vbNullString
                End If
            Next
        End With
        For i = LBound(groupResults) To UBound(groupResults)
            .Cells(i + 1, "C") = groupResults(i)
        Next
    End With
    Application.ScreenUpdating = True
End Sub
  • Microsoft HTML对象库

  • CSS选择器:

    Option Explicit 
    Public Sub GetInfo()
        Dim http As clsHTTP, sResponse As String, lastRow As Long, groupResults(), i As Long, html As HTMLDocument
        Set html = New HTMLDocument
        Set http = New clsHTTP
        Const BASE_URL As String = "http://www.abr.business.gov.au/ABN/View/"
        Application.ScreenUpdating = False
    
        With ThisWorkbook.Worksheets("Sheet1")
            lastRow = .Cells(.rows.Count, "B").End(xlUp).Row
            Select Case lastRow
            Case 1
                Exit Sub
            Case 2
                ReDim arr(1, 1): arr(1, 1) = .Range("B2").Value
            Case Else
                arr = .Range("B2:B" & lastRow).Value
            End Select
    
            ReDim groupResults(1 To lastRow - 1)
    
            With http
                For i = LBound(arr, 1) To UBound(arr, 1)
                    If Len(BASE_URL & arr(i, 1)) > Len(BASE_URL) Then
                        sResponse = .GetHTML(BASE_URL & arr(i, 1))
                        html.body.innerHTML = sResponse
                        groupResults(i) = .GetEntityType(html)
                        sResponse = vbNullString: html.body.innerHTML = vbNullString
                    End If
                Next
            End With
            For i = LBound(groupResults) To UBound(groupResults)
                .Cells(i + 1, "C") = groupResults(i)
            Next
        End With
        Application.ScreenUpdating = True
    End Sub
    
    我使用实体描述是一个超链接(
    a
    tag)并且其值包含字符串
    EntityTypeDescription
    的事实来使用css属性=值和contains(*)操作符作为目标


    您不希望以这种方式设置大量连接(查询表)。如果可能的话,它将是如此缓慢。在8000个请求时,如果xmlhttp没有被阻止或限制,下面的方法将大大加快速度。如果确实出现减速/阻塞,则每隔x个请求添加一个小的等待

    如果可能,使用xmlhttp收集数据。用于专门针对实体类型。将值存储在数组中,并在末尾使用循环进行写出。使用类来保存xmlhttp对象以提高效率。为类提供方法,包括如何处理not found(给出的示例)。添加一些进一步的优化,例如给定的是关闭屏幕更新。这假设您的搜索号码位于B2的B列中。下面的代码还对B列中是否存在某些内容进行一些基本检查,并处理存在1个或多个数字的情况

    好的代码是模块化的,您需要一个函数来返回某些内容,并需要一个子函数来执行操作。单个子功能不应完成大量任务。您希望使用遵循(或接近)原则的代码轻松调试

    类别clsHTTP

    Option Explicit
    
    Private http As Object  
    Private Sub Class_Initialize()
        Set http = CreateObject("MSXML2.XMLHTTP")
    End Sub
    
    Public Function GetHTML(ByVal URL As String) As String
        Dim sResponse As String
        With http
            .Open "GET", URL, False
            .send
            GetHTML = StrConv(.responseBody, vbUnicode)
        End With
    End Function
    
    Public Function GetEntityType(ByVal html As HTMLDocument) As String
        On Error GoTo errhand:
         GetEntityType = html.querySelector("a[href*='EntityTypeDescription']").innerText
        Exit Function
    errhand:
        GetEntityType = "Not Found"
    End Function
    
    标准模块:

    Option Explicit 
    Public Sub GetInfo()
        Dim http As clsHTTP, sResponse As String, lastRow As Long, groupResults(), i As Long, html As HTMLDocument
        Set html = New HTMLDocument
        Set http = New clsHTTP
        Const BASE_URL As String = "http://www.abr.business.gov.au/ABN/View/"
        Application.ScreenUpdating = False
    
        With ThisWorkbook.Worksheets("Sheet1")
            lastRow = .Cells(.rows.Count, "B").End(xlUp).Row
            Select Case lastRow
            Case 1
                Exit Sub
            Case 2
                ReDim arr(1, 1): arr(1, 1) = .Range("B2").Value
            Case Else
                arr = .Range("B2:B" & lastRow).Value
            End Select
    
            ReDim groupResults(1 To lastRow - 1)
    
            With http
                For i = LBound(arr, 1) To UBound(arr, 1)
                    If Len(BASE_URL & arr(i, 1)) > Len(BASE_URL) Then
                        sResponse = .GetHTML(BASE_URL & arr(i, 1))
                        html.body.innerHTML = sResponse
                        groupResults(i) = .GetEntityType(html)
                        sResponse = vbNullString: html.body.innerHTML = vbNullString
                    End If
                Next
            End With
            For i = LBound(groupResults) To UBound(groupResults)
                .Cells(i + 1, "C") = groupResults(i)
            Next
        End With
        Application.ScreenUpdating = True
    End Sub
    

    参考资料(VBE>工具>参考资料):

    Option Explicit 
    Public Sub GetInfo()
        Dim http As clsHTTP, sResponse As String, lastRow As Long, groupResults(), i As Long, html As HTMLDocument
        Set html = New HTMLDocument
        Set http = New clsHTTP
        Const BASE_URL As String = "http://www.abr.business.gov.au/ABN/View/"
        Application.ScreenUpdating = False
    
        With ThisWorkbook.Worksheets("Sheet1")
            lastRow = .Cells(.rows.Count, "B").End(xlUp).Row
            Select Case lastRow
            Case 1
                Exit Sub
            Case 2
                ReDim arr(1, 1): arr(1, 1) = .Range("B2").Value
            Case Else
                arr = .Range("B2:B" & lastRow).Value
            End Select
    
            ReDim groupResults(1 To lastRow - 1)
    
            With http
                For i = LBound(arr, 1) To UBound(arr, 1)
                    If Len(BASE_URL & arr(i, 1)) > Len(BASE_URL) Then
                        sResponse = .GetHTML(BASE_URL & arr(i, 1))
                        html.body.innerHTML = sResponse
                        groupResults(i) = .GetEntityType(html)
                        sResponse = vbNullString: html.body.innerHTML = vbNullString
                    End If
                Next
            End With
            For i = LBound(groupResults) To UBound(groupResults)
                .Cells(i + 1, "C") = groupResults(i)
            Next
        End With
        Application.ScreenUpdating = True
    End Sub
    
  • Microsoft HTML对象库

  • CSS选择器:

    Option Explicit 
    Public Sub GetInfo()
        Dim http As clsHTTP, sResponse As String, lastRow As Long, groupResults(), i As Long, html As HTMLDocument
        Set html = New HTMLDocument
        Set http = New clsHTTP
        Const BASE_URL As String = "http://www.abr.business.gov.au/ABN/View/"
        Application.ScreenUpdating = False
    
        With ThisWorkbook.Worksheets("Sheet1")
            lastRow = .Cells(.rows.Count, "B").End(xlUp).Row
            Select Case lastRow
            Case 1
                Exit Sub
            Case 2
                ReDim arr(1, 1): arr(1, 1) = .Range("B2").Value
            Case Else
                arr = .Range("B2:B" & lastRow).Value
            End Select
    
            ReDim groupResults(1 To lastRow - 1)
    
            With http
                For i = LBound(arr, 1) To UBound(arr, 1)
                    If Len(BASE_URL & arr(i, 1)) > Len(BASE_URL) Then
                        sResponse = .GetHTML(BASE_URL & arr(i, 1))
                        html.body.innerHTML = sResponse
                        groupResults(i) = .GetEntityType(html)
                        sResponse = vbNullString: html.body.innerHTML = vbNullString
                    End If
                Next
            End With
            For i = LBound(groupResults) To UBound(groupResults)
                .Cells(i + 1, "C") = groupResults(i)
            Next
        End With
        Application.ScreenUpdating = True
    End Sub
    
    我使用实体描述是一个超链接(
    a
    tag)并且其值包含字符串
    EntityTypeDescription
    的事实来使用css属性=值和contains(*)操作符作为目标


    请注意,这需要输入到模块中,而不是编码到图纸对象中。此外,如果代码不适用于Sheet1和Sheet2,则您可以将其更改为工作表(“名称”),其中名称是您在工作簿底部看到的各个选项卡的名称。谢谢!我刚和一个朋友聊天,他建议放弃搜索