提取电子邮件HTML元素_Html_Excel_Vba_Web Scraping_Css Selectors

提取电子邮件HTML元素

html excel vba web-scraping

提取电子邮件HTML元素,html,excel,vba,web-scraping,css-selectors,Html,Excel,Vba,Web Scraping,Css Selectors,我正在努力刮一页，但有一点我被卡住了。这里是整个HTML页面的第一个HTML部分路线 Mehr详细信息非常感谢。我可以用这些线算出 If InStr(post.Item(i).getElementsByTagName("a")(1).href, "mailto:") Then Debug.Print Split(Split(post.Item(i).getElementsByTagName("a")(1).href, "mailto:")(1), "?")(0) End

我正在努力刮一页，但有一点我被卡住了。这里是整个HTML页面的第一个HTML部分


路线
Mehr详细信息

非常感谢。我可以用这些线算出

    If InStr(post.Item(i).getElementsByTagName("a")(1).href, "mailto:") Then
    Debug.Print Split(Split(post.Item(i).getElementsByTagName("a")(1).href, "mailto:")(1), "?")(0)
End If

但我欢迎任何其他改进和学习的建议。 *测试后，如果在元素中找不到电子邮件，我会遇到一个错误。如何避免错误？我可以在下一步恢复错误时使用

。但我希望处理错误，而不是跳过它
**编辑：
我可以用这个结构来解决第二点
    Dim emailObj As Object

Set emailObj = post.Item(i).getElementsByTagName("a")(1)
If Not emailObj Is Nothing Then
 If InStr(post.Item(i).getElementsByTagName("a")(1).href, "mailto:") Then
    Debug.Print Split(Split(post.Item(i).getElementsByTagName("a")(1).href, "mailto:")(1), "?")(0)
End If

代码可以正常工作，但有时无法正确抓取电子邮件。。那是因为这条线
有时对象未指定给1。所以我的最后一个问题是：无论分配的号码是多少，我如何获得电子邮件数据
在圈内，我尝试了这条线，但没有用
Set aNodeList = post.Item(i).querySelectorAll(".contains-icon-email")(0)

原始问题：
在这种情况下，我将使用attribute=value选择器和contains操作符，通过字符串mailto
将href
属性作为目标。添加css选择器：[href*=mailto]

如果使用queryselectoral（“[href*=mailto]”
可以测试.Length
属性是否大于0，或者使用querySelector
并测试如果不是querySelector（“[href*=mailto]”）则什么都不是
如果设置为变量
Dim ele As Object

Set ele = html.document.querySelector("[href*=mailto]")
If Not ele Is Nothing Then
    Debug.Print ele.href  'do something with the href to parse out email
End If


更新问题：
对于更新后的问题，我会将当前节点在nodeList中的outerHTML
转换为代理HTMLDocument
变量，这样我就可以再次利用querySelector
方法。我会以班级为目标发送电子邮件
Option Explicit

Public Sub GetListingInfo()

    Const URL As String = "https://www.gelbeseiten.de/Suche/Ambulante%20Pflegedienste/Bundesweit"
    Dim http As MSXML2.XMLHTTP60, html As MSHTML.HTMLDocument

    Set http = New MSXML2.XMLHTTP60
    Set html = New MSHTML.HTMLDocument

    With http
        .Open "Get", URL, False
        .send
        html.body.innerHTML = .responseText
    End With

    Dim post As Object, html2 As MSHTML.HTMLDocument

    Set post = html.querySelectorAll(".mod-Treffer")
    Set html2 = New MSHTML.HTMLDocument

    Dim i As Long, emailNode As Object

    With ActiveSheet

        .Range("A1").Resize(1, 3).Value = Array("Title", "Phone", "Email")

        For i = 0 To post.Length - 1

            html2.body.innerHTML = post.Item(i).outerHTML

            .Cells(i + 2, 1).Value = html2.querySelector("h2").innerText
            .Cells(i + 2, 2).Value = html2.querySelector(".mod-AdresseKompakt__phoneNumber").innerText

            Set emailNode = html2.querySelector(".contains-icon-email")

            If Not emailNode Is Nothing Then .Cells(i + 2, 3).Value = Replace$(emailNode.href, "mailto:", vbNullString)
        Next i
    End With
End Sub

谢谢。事实上，我会在我提到的post对象的循环中使用它，所以我将您的行更改为Set ele=post.Item（I）.querySelector（“[href*=mailto]”），但这会引发一个错误。什么毛病？。到目前为止，我已经更新并放置了代码。除了VBA中的文档，您不能使用querySelector。您希望检索多少封电子邮件？您可以使用querySelectorAll收集所有mailto href节点，并循环提取hrefI在您现在添加到问题的url上看不到任何电子邮件。搜索也不会返回任何结果。是否有一个url可以让我看到标题、电话、电子邮件？使用两个HTMLDocument时，您不会注意到任何速度问题。不过，通过使用querySelector，您可以访问更快的css选择器。
Option Explicit

Public Sub GetListingInfo()

    Const URL As String = "https://www.gelbeseiten.de/Suche/Ambulante%20Pflegedienste/Bundesweit"
    Dim http As MSXML2.XMLHTTP60, html As MSHTML.HTMLDocument

    Set http = New MSXML2.XMLHTTP60
    Set html = New MSHTML.HTMLDocument

    With http
        .Open "Get", URL, False
        .send
        html.body.innerHTML = .responseText
    End With

    Dim post As Object, html2 As MSHTML.HTMLDocument

    Set post = html.querySelectorAll(".mod-Treffer")
    Set html2 = New MSHTML.HTMLDocument

    Dim i As Long, emailNode As Object

    With ActiveSheet

        .Range("A1").Resize(1, 3).Value = Array("Title", "Phone", "Email")

        For i = 0 To post.Length - 1

            html2.body.innerHTML = post.Item(i).outerHTML

            .Cells(i + 2, 1).Value = html2.querySelector("h2").innerText
            .Cells(i + 2, 2).Value = html2.querySelector(".mod-AdresseKompakt__phoneNumber").innerText

            Set emailNode = html2.querySelector(".contains-icon-email")

            If Not emailNode Is Nothing Then .Cells(i + 2, 3).Value = Replace$(emailNode.href, "mailto:", vbNullString)
        Next i
    End With
End Sub