Html 未从网站导入到excel的所有数据

Html 未从网站导入到excel的所有数据,html,excel,vba,web-scraping,web-site-project,Html,Excel,Vba,Web Scraping,Web Site Project,我想将餐厅名称、电话号码和网站等餐厅数据导入excel,但不幸的是,我只获得了一页(第一页),但我希望将我定义的任何范围(如第1页到第3页或第2页到第5页)的数据分别放在单独的页面中。样本输出文件是什么样的输出我得到的时间附加。 这是已经完成但难以获得所需输出的工作使用VBA执行此操作的唯一方法是检查是否存在“下一步”按钮并单击它(如果有): 这是它的HTML: <a class="next ajax-page" href="/atlanta-ga/restaurants?page=2

我想将餐厅名称、电话号码和网站等餐厅数据导入excel,但不幸的是,我只获得了一页(第一页),但我希望将我定义的任何范围(如第1页到第3页或第2页到第5页)的数据分别放在单独的页面中。样本输出文件是什么样的输出我得到的时间附加。


这是已经完成但难以获得所需输出的工作

使用VBA执行此操作的唯一方法是检查是否存在“下一步”按钮并单击它(如果有):

这是它的HTML:

<a class="next ajax-page" href="/atlanta-ga/restaurants?page=2" data-page="2" data-analytics="{&quot;click_id&quot;:132}" data-remote="true" data-impressed="1">Next</a>


这不是用VBA实现的“科幻小说”,但是,有一些商业RPA解决方案,它们为这项任务提供了“开箱即用”的功能——UiPath、AutomationAnywhere、BluePrism。Python的“漂亮汤”也可以做得很好。

使用VBA实现这一点的唯一方法是检查是否存在“下一步”按钮并单击它(如果有):

这是它的HTML:

<a class="next ajax-page" href="/atlanta-ga/restaurants?page=2" data-page="2" data-analytics="{&quot;click_id&quot;:132}" data-remote="true" data-impressed="1">Next</a>


这不是用VBA实现的“科幻小说”,但是,有一些商业RPA解决方案,它们为这项任务提供了“开箱即用”的功能——UiPath、AutomationAnywhere、BluePrism。Python的“漂亮汤”也会做得很好。

页面连接到url的末尾。我将使用xhr在给定页面范围内的循环中发出请求,并用正则表达式输出包含所需信息的json(它位于一个脚本标记中)。这种方法非常快速,并且比正则表达式的使用更有效。我还尽可能重复使用对象

我用它来处理json并解析出所需的信息(json中有很多信息,包括评论)。下载.bas并添加到项目中名为JsonConverter的模块后,需要转到VBE>工具>引用>添加对Microsoft脚本运行时的引用

Helper函数用于测试要写入的页面是否已经存在或需要创建,以及将json结果写入数组并在一次转到表中转储数组的页面(效率增益)。保留结构,以便在需要更多信息(如审查)时可以轻松扩展检索到的信息

可能有一些工作要做,以确保为不存在的页面工作。我目前只是使用响应的状态代码来过滤掉这些


注意事项:

Option Explicit
Public Sub GetRestuarantInfo()
    Dim s As String, re As Object, p As String, page As Long, r As String, json As Object
    Const START_PAGE As Long = 2
    Const END_PAGE As Long = 4
    Const RESULTS_PER_PAGE As Long = 30

    p = "\[{""@context"".*?\]"
    Set re = CreateObject("VBScript.RegExp")

    Application.ScreenUpdating = False

    With CreateObject("MSXML2.XMLHTTP")

        For page = START_PAGE To END_PAGE
            .Open "GET", "https://www.yellowpages.com/atlanta-ga/restaurants?page=" & page, False
            .send
            If .Status = 200 Then
                s = .responseText
                r = GetValue(re, s, p)
                If r <> "Not Found" Then
                    Set json = JsonConverter.ParseJson(r)
                    WriteOutResults page, RESULTS_PER_PAGE, json
                End If
            End If
        Next
    End With
    Application.ScreenUpdating = True
End Sub
Public Sub WriteOutResults(ByVal page As Long, ByVal RESULTS_PER_PAGE As Long, ByVal json As Object)
    Dim sheetName As String, results(), r As Long, headers(), ws As Worksheet
    ReDim results(1 To RESULTS_PER_PAGE, 1 To 3)

    sheetName = "page" & page
    headers = Array("Name", "Website", "Tel")
    If Not WorksheetExists(sheetName) Then
        Set ws = ThisWorkbook.Worksheets.Add
        ws.Name = sheetName
    Else
        ThisWorkbook.Worksheets(sheetName).Cells.ClearContents
    End If
    With ws
        Dim review As Object
        For Each review In json  'collection of dictionaries
            r = r + 1
            results(r, 1) = review("name")
            results(r, 2) = review("url")
            results(r, 3) = review("telephone")
        Next
        .Cells(1, 1).Resize(1, UBound(headers) + 1) = headers
        .Cells(2, 1).Resize(UBound(results, 1), UBound(results, 2)) = results
    End With
End Sub

Public Function GetValue(ByVal re As Object, inputString As String, ByVal pattern As String) As String
'https://regex101.com/r/M9oRON/1
    With re
        .Global = True
        .MultiLine = True
        .IgnoreCase = False
        .pattern = pattern
        If .Test(inputString) Then
            GetValue = .Execute(inputString)(0)
        Else
            GetValue = "Not found"
        End If
    End With
End Function

Public Function WorksheetExists(ByVal sName As String) As Boolean  '@Rory https://stackoverflow.com/a/28473714/6241235
    WorksheetExists = Evaluate("ISREF('" & sName & "'!A1)")
End Function
作为一个健全的检查,我将使用InternetExplorer转到第1页并提取总结果计数。我会除以每页的结果(目前为30)来计算总页数。这将为我提供lbound和ubound值(可能的页面的最小值和最大值)。然后切换到xmlhttp以实际检索。请参见末尾的附加辅助函数


代码:

Option Explicit
Public Sub GetRestuarantInfo()
    Dim s As String, re As Object, p As String, page As Long, r As String, json As Object
    Const START_PAGE As Long = 2
    Const END_PAGE As Long = 4
    Const RESULTS_PER_PAGE As Long = 30

    p = "\[{""@context"".*?\]"
    Set re = CreateObject("VBScript.RegExp")

    Application.ScreenUpdating = False

    With CreateObject("MSXML2.XMLHTTP")

        For page = START_PAGE To END_PAGE
            .Open "GET", "https://www.yellowpages.com/atlanta-ga/restaurants?page=" & page, False
            .send
            If .Status = 200 Then
                s = .responseText
                r = GetValue(re, s, p)
                If r <> "Not Found" Then
                    Set json = JsonConverter.ParseJson(r)
                    WriteOutResults page, RESULTS_PER_PAGE, json
                End If
            End If
        Next
    End With
    Application.ScreenUpdating = True
End Sub
Public Sub WriteOutResults(ByVal page As Long, ByVal RESULTS_PER_PAGE As Long, ByVal json As Object)
    Dim sheetName As String, results(), r As Long, headers(), ws As Worksheet
    ReDim results(1 To RESULTS_PER_PAGE, 1 To 3)

    sheetName = "page" & page
    headers = Array("Name", "Website", "Tel")
    If Not WorksheetExists(sheetName) Then
        Set ws = ThisWorkbook.Worksheets.Add
        ws.Name = sheetName
    Else
        ThisWorkbook.Worksheets(sheetName).Cells.ClearContents
    End If
    With ws
        Dim review As Object
        For Each review In json  'collection of dictionaries
            r = r + 1
            results(r, 1) = review("name")
            results(r, 2) = review("url")
            results(r, 3) = review("telephone")
        Next
        .Cells(1, 1).Resize(1, UBound(headers) + 1) = headers
        .Cells(2, 1).Resize(UBound(results, 1), UBound(results, 2)) = results
    End With
End Sub

Public Function GetValue(ByVal re As Object, inputString As String, ByVal pattern As String) As String
'https://regex101.com/r/M9oRON/1
    With re
        .Global = True
        .MultiLine = True
        .IgnoreCase = False
        .pattern = pattern
        If .Test(inputString) Then
            GetValue = .Execute(inputString)(0)
        Else
            GetValue = "Not found"
        End If
    End With
End Function

Public Function WorksheetExists(ByVal sName As String) As Boolean  '@Rory https://stackoverflow.com/a/28473714/6241235
    WorksheetExists = Evaluate("ISREF('" & sName & "'!A1)")
End Function
选项显式
公共子getRestuantInfo()
Dim s作为字符串,re作为对象,p作为字符串,page作为长,r作为字符串,json作为对象
Const START_页面长度=2
Const END_页面长度=4
每页的常量结果长度=30
p=“\[{”“@context”“*?\]”
Set re=CreateObject(“VBScript.RegExp”)
Application.ScreenUpdating=False
使用CreateObject(“MSXML2.XMLHTTP”)
对于页面=开始页面到结束页面
.打开“获取”https://www.yellowpages.com/atlanta-ga/restaurants?page=第页,错误(&P)
.发送
如果.Status=200,则
s=.responseText
r=GetValue(re、s、p)
如果r“未找到”,则
Set json=JsonConverter.ParseJson(r)
WriteOutResults页面,每个页面的结果,json
如果结束
如果结束
下一个
以
Application.ScreenUpdating=True
端接头
Public Sub WriteOutResults(ByVal页面长度,ByVal结果/页面长度,ByVal json对象)
Dim sheetName为字符串,results(),r为长,headers(),ws为工作表
重拨结果(每页1到结果,1到3)
sheetName=“page”&page
标题=数组(“名称”、“网站”、“电话”)
如果不是工作表列表(sheetName),则
设置ws=ThisWorkbook.Worksheets.Add
ws.Name=sheetName
其他的
ThisWorkbook.Worksheets(sheetName).Cells.ClearContents
如果结束
与ws
作为对象的Dim review
对于json的字典集合中的每个评论
r=r+1
结果(r,1)=审查(“名称”)
结果(r,2)=审查(“url”)
结果(r,3)=审查(“电话”)
下一个
.单元格(1,1).调整大小(1,UBound(页眉)+1)=页眉
.单元格(2,1).调整大小(UBound(结果,1),UBound(结果,2))=结果
以
端接头
公共函数GetValue(ByVal re作为对象,inputString作为字符串,ByVal模式作为字符串)作为字符串
'https://regex101.com/r/M9oRON/1
带re
.Global=True
.MultiLine=True
.IgnoreCase=False
.模式
如果.Test(inputString),则
GetValue=.Execute(inputString)(0)
其他的
GetValue=“未找到”
如果结束
以
端函数
公共函数工作表列表(ByVal sName作为字符串)作为布尔值'@Roryhttps://stackoverflow.com/a/28473714/6241235
WorksheetExists=Evaluate(“ISREF(“&sName&”!A1)”)
端函数

用于返回页数的Helper函数

'VBE > Tools > References: Microsoft Internet Controls
Public Function GetNumberOfPages(ByVal RESULTS_PER_PAGE As Long) As Variant
    Dim ie As Object, totalResults As Long
    On Error GoTo errhand
    Set ie = CreateObject("InternetExplorer.Application")
    With ie
        .Visible = False
        .Navigate2 "https://www.yellowpages.com/atlanta-ga/restaurants?page=1"

        While .Busy Or .readyState < 4: DoEvents: Wend

        With .document
            totalResults = Replace$(Replace$(.querySelector(".pagination  p").innerText, "We found", vbNullString), "results", vbNullString)
            GetNumberOfPages = totalResults / RESULTS_PER_PAGE
            ie.Quit
            Exit Function
        End With
    End With
errhand:
    If Err.Number <> 0 Then
        GetNumberOfPages = CVErr(xlErrNA)
    End If
End Function
'VBE>工具>参考:Microsoft Internet控件
公共函数GetNumberOfPages(ByVal结果每_页面长度)作为变量
Dim ie作为对象,totalResults作为长
在错误上走错
设置ie=CreateObject(“InternetExplorer.Application”)
与ie
.Visible=False
.导航2“https://www.yellowpages.com/atlanta-ga/restaurants?page=1"
当.Busy或.readyState<4:DoEvents:Wend时
随附.文件
totalResults=Replace$(Replace$(.querySelector(.pagination p”).innerText,“