Html 带if条件的Webscrape VBA

Html 带if条件的Webscrape VBA,html,excel,vba,web-scraping,Html,Excel,Vba,Web Scraping,我正在尝试将项目符号从网站导入excel表(每个项目符号都用li标记填充) 然而,我面临着一个重要的困难,因为我想刮的一些页面有几个“部分”(第一部分,第二部分,像这个),而其他页面没有(像这个) 我已经提出了一个我相信可以启动的代码草案,但是,我仍然有一些问题,我收到了一条错误消息(“超时”)。 你知道我怎么修吗 提前感谢您的帮助 Sub Page() GetPage ("https://www.thewindpower.net/windfarm_en_1922_a-capelada

我正在尝试将项目符号从网站导入excel表(每个项目符号都用li标记填充)

然而,我面临着一个重要的困难,因为我想刮的一些页面有几个“部分”(第一部分,第二部分,像这个),而其他页面没有(像这个)

我已经提出了一个我相信可以启动的代码草案,但是,我仍然有一些问题,我收到了一条错误消息(“超时”)。 你知道我怎么修吗

提前感谢您的帮助

Sub Page()
GetPage ("https://www.thewindpower.net/windfarm_en_1922_a-capelada-i.php")
End Sub



Sub GetPage(URL As String)

Dim count As Integer

Dim Request As MSXML2.ServerXMLHTTP60: Set Request = New MSXML2.ServerXMLHTTP60

Dim Result As HTMLDocument: Set Result = New HTMLDocument

Request.Open "GET", URL, False
Request.send

Result.body.innerHTML = Request.responseText

Dim oRows As MSHTML.IHTMLElementCollection
Dim oRow As MSHTML.IHTMLElement

Dim oCells As MSHTML.IHTMLElementCollection
Dim oCell As MSHTML.IHTMLElement

Dim oLinks As MSHTML.IHTMLElementCollection

'Set Generalities
Set oRows = Result.getElementsByTagName("ul")(4).getElementsByTagName("li")

Dim iRow As Integer 'output li counter
Dim iColumn As Integer 'output column counter
Dim Sheet As Worksheet 'output sheet

iRow = 1
iColumn = 1

Set Sheet = ThisWorkbook.Worksheets("Sheet1")


count = Result.getElementsByTagName("h3").Length


If count > 0 Then
    '# f Part on the page, 2 for the moment
    Dim p As Integer
    Dim o As Integer
    p = count / 2
    
    'Counter for each Part identified
    For o = 1 To p
                'Set Generalities data

                iRow = 1
                iColumn = 1
                            
                For Each oRow In oRows
                    Set oCells = oRow.getElementsByTagName("li")
                        For Each oCell In oCells
                                Sheet.Cells(iRow, iColumn).Value = oCell.innerText
                                iColumn = iColumn + 1
                        Next oCell
                        iRow = iRow + 1
                Next oRow
                        
                'Set Detail data
                Set oRows2 = Result.getElementsByTagName("h3")(o).getElementsByTagName("li")
                
                For Each oRow In oRows2
                    Set oCells = oRow.getElementsByTagName("li")
                        For Each oCell In oCells
                                Sheet.Cells(iRow, iColumn).Value = oCell.innerText
                                iColumn = iColumn + 1
                        Next oCell
                        iRow = iRow + 1
                        iColumn = 1
                Next oRow
                                   
        iRow = iRow + 1
        'insert a row
        Range("iRow").Insert CopyOrigin:=xlFormatFromRightOrBelow
        
        'increment Part counter
    Next o
    
    Else
    
        'Set Generalities data
            For Each oRow In oRows
                    Set oCells = oRow.getElementsByTagName("li")
                        For Each oCell In oCells
                                Sheet.Cells(iRow, iColumn).Value = oCell.innerText
                                iColumn = iColumn + 1
                        Next oCell
                        iRow = iRow + 1
                Next oRow
                        
                        
            'Set Detail data
            Set oRows2 = Result.getElementsByTagName("ul")(5).getElementsByTagName("li")
                
                For Each oRow In oRows2
                    Set oCells = oRow.getElementsByTagName("li")
                        For Each oCell In oCells
                                Sheet.Cells(iRow, iColumn).Value = oCell.innerText
                                iColumn = iColumn + 1
                        Next oCell
                        iRow = iRow + 1
                        iColumn = 1
                Next oRow
    End If
End Sub

摘要

我会通过css选择器收集节点列表,以匹配相关节点。我会有两个独立的节点列表。一个用于概括,另一个用于部分。我将确定部件的数量(当它们重复时),并循环到这些部件的数量;将html连接到后面与前者一起出现的重复部分。然后将组合的html放入代理HTMLDocument变量中,并为包含的所有
li
元素创建一个新的节点列表。使用helper函数返回数组中节点列表节点的文本,然后按每行新的组合文本将其写入工作表


VBA:

Option Explicit

Public Sub WindInfo()
    'VBE> Tools > References:
    '1. Microsoft, XML v6
    '2. Microsoft HTML Object Library
    '3. Microsoft Scripting Runtime
    Dim xhr As MSXML2.XMLHTTP60: Set xhr = New MSXML2.XMLHTTP60
    Dim html As MSHTML.HTMLDocument: Set html = New MSHTML.HTMLDocument
    Dim ws As Worksheet: Set ws = ThisWorkbook.Worksheets("Sheet1")
    
    With xhr
        .Open "GET", "https://www.thewindpower.net/windfarm_en_7410_khizi.php", False
        .send
        html.body.innerHTML = .responseText
    End With

    Dim generalities As Object, arrGen(), partsList As Object
    
    Dim r As Long

    Set generalities = html.querySelectorAll("#bloc_texte table ~ table li")
    arrGen = GetNodesTextAsArray(generalities)
    
    Dim parts As Object, numberOfParts As Long
    
    Set partsList = html.querySelectorAll("h1 ~ h3, ul ~ h3")
    
    r = 1
    
    If partsList.Length > 0 Then
    
        numberOfParts = html.querySelectorAll("h1 ~ h3, ul ~ h3").Length / 2
    
        Set parts = html.querySelectorAll("h3 + ul")
       
        Dim i As Long, liNodes As Object, arr()
        Dim html2 As MSHTML.HTMLDocument: Set html2 = New MSHTML.HTMLDocument
        
        For i = 0 To numberOfParts - 1
            ws.Cells(r, 1).Resize(1, UBound(arrGen)) = arrGen
            html2.body.innerHTML = parts.Item(i).outerHTML & parts.Item(i + numberOfParts).outerHTML
            Set liNodes = html2.querySelectorAll("li")
            arr = GetNodesTextAsArray(liNodes)
            ws.Cells(r, 5).Resize(1, UBound(arr)) = arr
            r = r + 1
        Next
    Else
        Dim alternateNodeList As Object: Set alternateNodeList = html.querySelectorAll("#bloc_texte h1 + ul")
        
        If alternateNodeList.Length >= 1 Then
            arr = GetNodesTextAsArray(alternateNodeList.Item(1).getElementsByTagName("li"))
        Else
            arr = Array("No", "Data", vbNullString)
        End If
        ws.Cells(r, 1).Resize(1, UBound(arrGen)) = arrGen
        ws.Cells(r, 5).Resize(1, UBound(arr)) = arr
    End If
End Sub

Public Function GetNodesTextAsArray(ByVal nodeList As Object) As Variant()
    Dim i As Long, results()
    
    If nodeList.Length = 0 Then
        GetNodesTextAsArray = Array("No", "Data", vbNullString)
        Exit Function
    End If
    
    ReDim results(1 To nodeList.Length)

    For i = 0 To nodeList.Length - 1
        results(i + 1) = nodeList.Item(i).innerText
    Next i
    GetNodesTextAsArray = results
End Function

参考文献:

Option Explicit

Public Sub WindInfo()
    'VBE> Tools > References:
    '1. Microsoft, XML v6
    '2. Microsoft HTML Object Library
    '3. Microsoft Scripting Runtime
    Dim xhr As MSXML2.XMLHTTP60: Set xhr = New MSXML2.XMLHTTP60
    Dim html As MSHTML.HTMLDocument: Set html = New MSHTML.HTMLDocument
    Dim ws As Worksheet: Set ws = ThisWorkbook.Worksheets("Sheet1")
    
    With xhr
        .Open "GET", "https://www.thewindpower.net/windfarm_en_7410_khizi.php", False
        .send
        html.body.innerHTML = .responseText
    End With

    Dim generalities As Object, arrGen(), partsList As Object
    
    Dim r As Long

    Set generalities = html.querySelectorAll("#bloc_texte table ~ table li")
    arrGen = GetNodesTextAsArray(generalities)
    
    Dim parts As Object, numberOfParts As Long
    
    Set partsList = html.querySelectorAll("h1 ~ h3, ul ~ h3")
    
    r = 1
    
    If partsList.Length > 0 Then
    
        numberOfParts = html.querySelectorAll("h1 ~ h3, ul ~ h3").Length / 2
    
        Set parts = html.querySelectorAll("h3 + ul")
       
        Dim i As Long, liNodes As Object, arr()
        Dim html2 As MSHTML.HTMLDocument: Set html2 = New MSHTML.HTMLDocument
        
        For i = 0 To numberOfParts - 1
            ws.Cells(r, 1).Resize(1, UBound(arrGen)) = arrGen
            html2.body.innerHTML = parts.Item(i).outerHTML & parts.Item(i + numberOfParts).outerHTML
            Set liNodes = html2.querySelectorAll("li")
            arr = GetNodesTextAsArray(liNodes)
            ws.Cells(r, 5).Resize(1, UBound(arr)) = arr
            r = r + 1
        Next
    Else
        Dim alternateNodeList As Object: Set alternateNodeList = html.querySelectorAll("#bloc_texte h1 + ul")
        
        If alternateNodeList.Length >= 1 Then
            arr = GetNodesTextAsArray(alternateNodeList.Item(1).getElementsByTagName("li"))
        Else
            arr = Array("No", "Data", vbNullString)
        End If
        ws.Cells(r, 1).Resize(1, UBound(arrGen)) = arrGen
        ws.Cells(r, 5).Resize(1, UBound(arr)) = arr
    End If
End Sub

Public Function GetNodesTextAsArray(ByVal nodeList As Object) As Variant()
    Dim i As Long, results()
    
    If nodeList.Length = 0 Then
        GetNodesTextAsArray = Array("No", "Data", vbNullString)
        Exit Function
    End If
    
    ReDim results(1 To nodeList.Length)

    For i = 0 To nodeList.Length - 1
        results(i + 1) = nodeList.Item(i).innerText
    Next i
    GetNodesTextAsArray = results
End Function

  • 将有助于指示所需的输出格式(除每行仅包含一个零件外。例如,一行是否包含生产预测?您是否需要标题详细信息和本地化?即使不是零件,是否也要包含概括性?详细信息和本地化中的第1部分是否应继续包含到一行,第2部分和第3部分相同?感谢您的反馈QHarr!我为eac提供了一个事实。)h页面我希望在excel中获得一行,其中包括(I)概述部分的bulletpoint和(ii)细节部分的bulletpoint;如果只有一个部分,则每个bulletpoint位于不同的列中。如果有多个部分,则我希望获得一行与上述bulletpoint相同的bulletpoint(各部分相同)除了没有生产预测的细节之外,我已经有了一个基于这个问题答案的表格:概括性是否有自己的行,因为它似乎没有与特定的细节部分相关联?事实上,概括性中的信息是关于所有部分的,因此我想在关于第1部分的行中,也在a行中关于第2部分,第3部分,…@QHarr我已经根据我的研究和尝试更新了上面的代码,它还没有工作,但至少有一个可以使用的结构的想法,你能告诉我你对它的看法吗?谢谢你的回答QHarr!它看起来工作得很好!但是没有部分的页面(因为只有一个),就像这一个,它只检测一般性,你认为我可以用基于numberOfParts值的if语句来修复它吗?我会设置html.querySelectorAll(“h1~h3,ul~h3”)长度为变量,如果长度为0,则可以将该变量/ 2除以获取部件数等。否则,您将需要处理A/0错误。我刚刚尝试过它,显然它没有改变任何东西,细节部分没有被检测到,我应该在第一个子结束时更改for循环吗?抱歉…您想要PAR。ts和概括性不是吗?问题中没有提到细节。您可以测试html中是否存在作为字符串的部分,如果不存在,则假设存在概括性和desc,并且可以与作为css选择器的#bloc#u texte h1+ul匹配。然后您只需要前两个节点,而不需要第三个节点,即索引0和索引1。