Word VBA网页刮片如何刮片某些类&;跳过别人

Word VBA网页刮片如何刮片某些类&;跳过别人,vba,web-scraping,ms-word,Vba,Web Scraping,Ms Word,我试图按顺序从网页上刮取某些类,但不刮取其他类。但是,我无法确定如何有选择地选择我感兴趣的“子”类-转录问题和转录答案,而不是时间戳,它们似乎都在转录项目包装器中 有没有一种优雅的方法可以做到这一点,或者我需要使用提取的字符串并删除不需要的HTML代码 当前代码: Sub ScrapeToWord() Const URL = "http://......." Dim http As New MSXML2.XMLHTTP60, html As New HTMLDocument Dim topics

我试图按顺序从网页上刮取某些类,但不刮取其他类。但是,我无法确定如何有选择地选择我感兴趣的“子”类-
转录问题
转录答案
,而不是
时间戳
,它们似乎都在
转录项目包装器

有没有一种优雅的方法可以做到这一点,或者我需要使用提取的字符串并删除不需要的HTML代码

当前代码:

Sub ScrapeToWord()
Const URL = "http://......."
Dim http As New MSXML2.XMLHTTP60, html As New HTMLDocument
Dim topics As Object, posts As Object, topic As Object
http.Open "GET", URL, False
http.send
html.body.innerHTML = http.responseText
Set topics = html.getElementsByClassName("transcription-item-wrapper")
For Each posts In topics
    For Each topic In posts.getElementsByClassName("transcript-question")
        ActiveDocument.Tables(1).Cell(1, 1).Range.Text = topic.innerText
    Next topic
Next posts
End Sub
HTML代码的一个片段:

    <div class="transcription-section">
        <div class="transcription-section-wrapper">
        <div class="transcription-item-wrapper"><div class="transcript-qa"><div class="timestamp"></div><div class="transcript-question">Tape 01</div></div></div><div class="transcription-item-wrapper"><div class="transcript-qa"><div class="timestamp">
            <p class="34" id="01003400">01:00:34:00
            </p>
            <span class="listen"></span>
            <span class="watch"></span>
          </div><div class="transcript-question">Could begin with a brief overview of your life.</div></div><div class="transcript-qa"><div class="timestamp"></div><div class="transcript-answer">I was born in 1942. I was born on a farm and started school when I was 4 years old.</div></div></div><div class="transcription-item-wrapper"><div class="transcript-qa"><div class="timestamp">
            <p class="60" id="01010000">01:01:00:00
            </p>
            <span class="listen"></span>
            <span class="watch"></span>
          </div><div class="transcript-question">And then?</div></div><div class="transcript-qa"><div class="timestamp"></div><div class="transcript-answer">During the Depression my father lost the farm then we moved to Sandridge and I went to school there until I was about 8. We then went dairy farming there and</div></div></div><div class="transcription-item-wrapper"><div class="transcript-qa"><div class="timestamp">
          <p class="90" id="01013000">01:01:30:00
          </p>
          <span class="listen"></span>
          <span class="watch"></span>
        </div><div class="transcript-answer">no machine milking in those days, it was all hand milking. </div></div><div class="transcript-qa"><div class="timestamp">

磁带01

01:00:34:00

我出生于1942年。我出生在一个农场,4岁时开始上学。

01:01:00:00

然后呢?在大萧条期间,我父亲失去了农场,然后我们搬到了桑德里奇,我在那里上学,直到我8岁左右。然后我们去了那里的奶牛场

01:01:30:00

当时没有机器挤奶,全是手工挤奶。
您可以使用querySelectorAll收集项目的节点列表,并使用两个感兴趣的类与CSS或语法进行匹配。然后循环列表,对匹配的项执行任何操作

Dim i As Long, nodeList As Object

Set nodeList = html.querySelectorAll(".transcript-question, .transcript-answer")

For i = 0 To nodeList.Length-1
    Debug.Print nodeList.item(i).innerText 'do something with each return value e.g. put in a table or print out
Next

您可以使用querySelectorAll收集项目的节点列表,并使用CSS或语法匹配两个感兴趣的类。然后循环列表,对匹配的项执行任何操作

Dim i As Long, nodeList As Object

Set nodeList = html.querySelectorAll(".transcript-question, .transcript-answer")

For i = 0 To nodeList.Length-1
    Debug.Print nodeList.item(i).innerText 'do something with each return value e.g. put in a table or print out
Next

我没有看到此处列出的实际有效URL。也许你可以导入所有内容并过滤掉不需要的内容。我在这里没有看到实际有效的URL。也许您可以导入所有内容并过滤掉不需要的内容。