Html 使用vba从网站中抓取数据

Html 使用vba从网站中抓取数据,html,vba,excel,web-scraping,Html,Vba,Excel,Web Scraping,我试图从网站上抓取数据:通过vba,像实时价格,即德国5年期Bobl,美国30年期国债,我尝试过excel web query,但它只抓取整个网站,但我只想抓取费率,有没有办法做到这一点?有几种方法可以做到这一点。这是我写的一个答案,希望在浏览关键词“从网站上抓取数据”时能找到Internet Explorer自动化的所有基础知识,但请记住,没有什么比你自己的研究更有价值(如果你不想坚持你无法定制的预写代码) 请注意,这是单向的,我不喜欢它的性能(因为它取决于浏览器速度),但这有助于理解互联网自

我试图从网站上抓取数据:通过vba,像实时价格,即德国5年期Bobl,美国30年期国债,我尝试过excel web query,但它只抓取整个网站,但我只想抓取费率,有没有办法做到这一点?

有几种方法可以做到这一点。这是我写的一个答案,希望在浏览关键词“从网站上抓取数据”时能找到Internet Explorer自动化的所有基础知识,但请记住,没有什么比你自己的研究更有价值(如果你不想坚持你无法定制的预写代码)

请注意,这是单向的,我不喜欢它的性能(因为它取决于浏览器速度),但这有助于理解互联网自动化背后的原理

1) 如果我需要浏览网页,我需要一个浏览器!因此,我创建了一个Internet Explorer浏览器:

Dim appIE As Object
Set appIE = CreateObject("internetexplorer.application")
2) 我要求浏览器浏览目标网页。通过使用属性“.Visible”,我决定是否希望看到浏览器执行其工作。当构建代码时,拥有
Visible=True
很好,但是当代码用于抓取数据时,没有每次都看到它很好,所以
Visible=False

With appIE
    .Navigate "http://uk.investing.com/rates-bonds/financial-futures"
    .Visible = True
End With
3) 该网页将需要一些时间来加载。所以,我会在忙碌的时候等待

Do While appIE.Busy
    DoEvents
Loop
4) 好了,现在页面已加载。比如说,我想把US30Y T-Bond的零钱凑起来: 我要做的只是在Internet Explorer上单击F12以查看网页的代码,然后使用指针(在红色圆圈中)单击我要刮取的元素,以查看如何达到我的目的

5) 我应该做的是直截了当的。首先,我将通过ID属性获取包含以下值的
tr
元素:

Set allRowOfData = appIE.document.getElementById("pair_8907")
这里我将获得
td
元素的集合(具体来说,
tr
是一行数据,
td
是它的单元格。我们正在寻找第8个元素,因此我将写:

Dim myValue As String: myValue = allRowOfData.Cells(7).innerHTML
为什么我写7而不是8?因为单元格集合从0开始,所以第8个元素的索引是7(8-1)。简要分析这行代码:

  • .Cells()
    使我能够访问
    td
    元素
  • innerHTML
    是包含我们要查找的值的单元格的属性
一旦我们有了值(现在存储到
myValue
变量中),我们就可以关闭IE浏览器并通过将其设置为“无”释放内存:

appIE.Quit
Set appIE = Nothing
好了,现在你有了你的值,你可以用它做任何你想做的事情:把它放进一个单元格(
Range(“A1”).value=myValue
),或者放进一个表单的标签(
Me.label1.Text=myValue

我想指出的是,StackOverflow不是这样工作的:在这里,你会发布关于特定编码问题的问题,但你应该首先进行自己的搜索。我回答一个没有显示太多研究工作的问题的原因是,我看到它被问了好几次,回到我学会如何做的时候这一点,我记得我希望有更好的支持开始。所以我希望这个答案,这只是一个“研究投入”并不是最好的/最完整的解决方案,可以为下一个遇到同样问题的用户提供支持。因为我已经学会了如何编程,这要感谢这个社区,我想你和其他初学者可能会利用我的输入来发现编程的美丽世界


享受您的实践;)

您可以使用winhttprequest对象而不是internet explorer,因为最好加载不包含图片n广告的数据,而不是下载包含广告n图片的完整网页。与winhttprequest对象相比,这些图片使internet explorer对象更重

这个问题很久以前就被问到了。但我认为以下信息对新手很有用。实际上,您可以像这样轻松地从类名中获取值

Sub ExtractLastValue()

Set objIE = CreateObject("InternetExplorer.Application")

objIE.Top = 0
objIE.Left = 0
objIE.Width = 800
objIE.Height = 600

objIE.Visible = True

objIE.Navigate ("https://uk.investing.com/rates-bonds/financial-futures/")

Do
DoEvents
Loop Until objIE.readystate = 4

MsgBox objIE.document.getElementsByClassName("pid-8907-last")(0).innerText

End Sub
如果你不熟悉网络抓取,请阅读这篇博文

还有各种从网页中提取数据的技术。本文用例子来解释其中的几个


我修改了一些为我弹出错误的东西,最终得到了这样一个结果,它非常适合根据我的需要提取数据:

Sub get_data_web()

Dim appIE As Object
Set appIE = CreateObject("internetexplorer.application")

With appIE
    .navigate "https://finance.yahoo.com/quote/NQ%3DF/futures?p=NQ%3DF"
    .Visible = True
End With

Do While appIE.Busy
    DoEvents
Loop

Set allRowofData = appIE.document.getElementsByClassName("Ta(end) BdT Bdc($c-fuji-grey-c) H(36px)")

Dim i As Long
Dim myValue As String

Count = 1

    For Each itm In allRowofData

        For i = 0 To 4

        myValue = itm.Cells(i).innerText
        ActiveSheet.Cells(Count, i + 1).Value = myValue

        Next

        Count = Count + 1

    Next

appIE.Quit
Set appIE = Nothing


End Sub

还提到了其他方法,因此,请允许我们承认,在撰写本文时,我们正处于21世纪。让我们打开本地总线浏览器,带着请求飞行(简称XHR GET)

XHR是对象形式的API,其方法传输数据 在web浏览器和web服务器之间。对象由 浏览器的JavaScript环境

这是一种快速检索数据的方法,无需打开浏览器。可以将服务器响应读入HTMLDocument,然后从那里继续抓取表的过程

请注意,不会检索javascript呈现/动态添加的内容,因为没有运行javascript引擎(浏览器中有)

在下面的代码中,表格由其id
cr1
抓取

在helper子文件
WriteTable
中,我们循环列(
td
标记),然后循环表行(
tr
标记),最后遍历每个表行的长度,表单元格逐个表单元格。由于我们只需要第1列和第8列中的数据,因此使用
Select Case
语句指定写入工作表的内容


示例网页视图:

Option Explicit
Public Sub GetRates()
    Dim html As HTMLDocument, hTable As HTMLTable '<== Tools > References > Microsoft HTML Object Library
    
    Set html = New HTMLDocument
      
    With CreateObject("MSXML2.XMLHTTP")
        .Open "GET", "https://uk.investing.com/rates-bonds/financial-futures", False
        .setRequestHeader "If-Modified-Since", "Sat, 1 Jan 2000 00:00:00 GMT" 'to deal with potential caching
        .send
        html.body.innerHTML = .responseText
    End With
    
    Application.ScreenUpdating = False
    
    Set hTable = html.getElementById("cr1")
    WriteTable hTable, 1, ThisWorkbook.Worksheets("Sheet1")
    
    Application.ScreenUpdating = True
End Sub

Public Sub WriteTable(ByVal hTable As HTMLTable, Optional ByVal startRow As Long = 1, Optional ByVal ws As Worksheet)
    Dim tSection As Object, tRow As Object, tCell As Object, tr As Object, td As Object, r As Long, C As Long, tBody As Object
    r = startRow: If ws Is Nothing Then Set ws = ActiveSheet
    With ws
        Dim headers As Object, header As Object, columnCounter As Long
        Set headers = hTable.getElementsByTagName("th")
        For Each header In headers
            columnCounter = columnCounter + 1
            Select Case columnCounter
            Case 2
                .Cells(startRow, 1) = header.innerText
            Case 8
                .Cells(startRow, 2) = header.innerText
            End Select
        Next header
        startRow = startRow + 1
        Set tBody = hTable.getElementsByTagName("tbody")
        For Each tSection In tBody
            Set tRow = tSection.getElementsByTagName("tr")
            For Each tr In tRow
                r = r + 1
                Set tCell = tr.getElementsByTagName("td")
                C = 1
                For Each td In tCell
                    Select Case C
                    Case 2
                        .Cells(r, 1).Value = td.innerText
                    Case 8
                        .Cells(r, 2).Value = td.innerText
                    End Select
                    C = C + 1
                Next td
            Next tr
        Next tSection
    End With
End Sub


示例代码输出:

Option Explicit
Public Sub GetRates()
    Dim html As HTMLDocument, hTable As HTMLTable '<== Tools > References > Microsoft HTML Object Library
    
    Set html = New HTMLDocument
      
    With CreateObject("MSXML2.XMLHTTP")
        .Open "GET", "https://uk.investing.com/rates-bonds/financial-futures", False
        .setRequestHeader "If-Modified-Since", "Sat, 1 Jan 2000 00:00:00 GMT" 'to deal with potential caching
        .send
        html.body.innerHTML = .responseText
    End With
    
    Application.ScreenUpdating = False
    
    Set hTable = html.getElementById("cr1")
    WriteTable hTable, 1, ThisWorkbook.Worksheets("Sheet1")
    
    Application.ScreenUpdating = True
End Sub

Public Sub WriteTable(ByVal hTable As HTMLTable, Optional ByVal startRow As Long = 1, Optional ByVal ws As Worksheet)
    Dim tSection As Object, tRow As Object, tCell As Object, tr As Object, td As Object, r As Long, C As Long, tBody As Object
    r = startRow: If ws Is Nothing Then Set ws = ActiveSheet
    With ws
        Dim headers As Object, header As Object, columnCounter As Long
        Set headers = hTable.getElementsByTagName("th")
        For Each header In headers
            columnCounter = columnCounter + 1
            Select Case columnCounter
            Case 2
                .Cells(startRow, 1) = header.innerText
            Case 8
                .Cells(startRow, 2) = header.innerText
            End Select
        Next header
        startRow = startRow + 1
        Set tBody = hTable.getElementsByTagName("tbody")
        For Each tSection In tBody
            Set tRow = tSection.getElementsByTagName("tr")
            For Each tr In tRow
                r = r + 1
                Set tCell = tr.getElementsByTagName("td")
                C = 1
                For Each td In tCell
                    Select Case C
                    Case 2
                        .Cells(r, 1).Value = td.innerText
                    Case 8
                        .Cells(r, 2).Value = td.innerText
                    End Select
                    C = C + 1
                Next td
            Next tr
        Next tSection
    End With
End Sub


VBA:

Option Explicit
Public Sub GetRates()
    Dim html As HTMLDocument, hTable As HTMLTable '<== Tools > References > Microsoft HTML Object Library
    
    Set html = New HTMLDocument
      
    With CreateObject("MSXML2.XMLHTTP")
        .Open "GET", "https://uk.investing.com/rates-bonds/financial-futures", False
        .setRequestHeader "If-Modified-Since", "Sat, 1 Jan 2000 00:00:00 GMT" 'to deal with potential caching
        .send
        html.body.innerHTML = .responseText
    End With
    
    Application.ScreenUpdating = False
    
    Set hTable = html.getElementById("cr1")
    WriteTable hTable, 1, ThisWorkbook.Worksheets("Sheet1")
    
    Application.ScreenUpdating = True
End Sub

Public Sub WriteTable(ByVal hTable As HTMLTable, Optional ByVal startRow As Long = 1, Optional ByVal ws As Worksheet)
    Dim tSection As Object, tRow As Object, tCell As Object, tr As Object, td As Object, r As Long, C As Long, tBody As Object
    r = startRow: If ws Is Nothing Then Set ws = ActiveSheet
    With ws
        Dim headers As Object, header As Object, columnCounter As Long
        Set headers = hTable.getElementsByTagName("th")
        For Each header In headers
            columnCounter = columnCounter + 1
            Select Case columnCounter
            Case 2
                .Cells(startRow, 1) = header.innerText
            Case 8
                .Cells(startRow, 2) = header.innerText
            End Select
        Next header
        startRow = startRow + 1
        Set tBody = hTable.getElementsByTagName("tbody")
        For Each tSection In tBody
            Set tRow = tSection.getElementsByTagName("tr")
            For Each tr In tRow
                r = r + 1
                Set tCell = tr.getElementsByTagName("td")
                C = 1
                For Each td In tCell
                    Select Case C
                    Case 2
                        .Cells(r, 1).Value = td.innerText
                    Case 8
                        .Cells(r, 2).Value = td.innerText
                    End Select
                    C = C + 1
                Next td
            Next tr
        Next tSection
    End With
End Sub
选项显式
公共次级利率()
将html设置为HTMLDocument,将HTTable设置为HTMLTable'引用>Microsoft html对象库
设置html=新的HTMLDocument
使用CreateObject(“MSXML2.XMLHTTP”)
.打开“获取”https://uk