Html 在两组标记之间提取内容

Html 在两组标记之间提取内容,html,excel,vba,web-scraping,Html,Excel,Vba,Web Scraping,我正在尝试提取一些内容,并将其以表格格式放在Excel上。第一栏是国家,第二栏是他们正在实施的对抗冠状病毒的措施。以下是HTML的外观: <strong>AUSTRALIA</strong> - published 11.02.2020<br /> 1. Passengers who have transited through or have been in China (People's Rep.) on or after 1 February 2020,

我正在尝试提取一些内容,并将其以表格格式放在Excel上。第一栏是国家,第二栏是他们正在实施的对抗冠状病毒的措施。以下是HTML的外观:

<strong>AUSTRALIA</strong> - published 11.02.2020<br />
1. Passengers who have transited through or have been in China (People's Rep.) on or after 1 February 2020, will not be allowed to transit or enter Australia.<br />
- This does not apply to nationals of Australia. They will be required to self-isolate for a period of 14 days from their arrival into Australia.<br />
- This does not apply to permanent residents of Australia and their immediate family members. They will be required to self-isolate for a period of 14 days from their arrival into Australia.<br />
- This does not apply to airline crew.<br />
2. Nationals of Australia who have transited through or have been in China (People's Rep.) on or after 1 February 2020 will be required to self-isolate for a period of 14 days from their arrival into Australia.<br />
3. Permanent residents of Australia and their immediate family members who have transited through or have been in China (People's Rep.) on or after the 1 February 2020 will be required to self-isolate for a period of 14 days from their arrival into Australia.<br />
<br />
<strong>AZERBAIJAN</strong> - published 06.02.2020
澳大利亚-发布于2020年2月11日
1.2020年2月1日当天或之后过境或在中国(人民代表)的乘客将不允许过境或进入澳大利亚。
-这不适用于澳大利亚国民。他们将被要求在抵达澳大利亚后14天内进行自我隔离。
-这不适用于澳大利亚永久居民及其直系亲属。他们将被要求在抵达澳大利亚后14天内进行自我隔离。
-这不适用于航空公司机组人员。
2.在2020年2月1日或之后过境或在中国(人民代表)的澳大利亚国民将被要求在抵达澳大利亚后的14天内自我隔离。
3.在2020年2月1日当天或之后过境或曾在中国(人民代表)的澳大利亚永久居民及其直系亲属将被要求在抵达澳大利亚后14天内自我隔离。

阿塞拜疆-2020年2月6日发布
因此,没有真正的结构可言。但是,我希望能够将国家列表提取为一列(这很容易,因为它们位于强标记之间)。但我希望另一列是每个国家的相应文本。这更难,因为没有什么可以孤立这一点。我能想到的唯一一件事是让VBA在两组强标记之间循环,并将此内容提取为第二列。但我不知道该怎么做。到目前为止,我找到的代码允许我提取国家列表,而不是其他很多:

Sub Test()

Dim IE As New SHDocVw.InternetExplorer
Dim HTMLDoc As MSHTML.HTMLDocument
Dim HTMLAs As MSHTML.IHTMLElementCollection
Dim HTMLA As MSHTML.IHTMLElement

IE.Visible = True
IE.navigate "https://www.iatatravelcentre.com/international-travel-document-news/1580226297.htm"

Do While IE.ReadyState <> READYSTATE_COMPLETE
Loop

Set HTMLDoc = IE.Document

ProcessHTMLPage HTMLDoc

    Set HTMLAs = HTMLDoc.getElementsByTagName("strong")

    For Each HTMLA In HTMLAs

    Debug.Print HTMLA.innerText
    If HTMLA.getAttribute("href") = "http://x-rates.com/table/" And HTMLA.getAttribute("rel") = "ratestable" Then
        HTMLA.Click
'I don't understand why, but the previous line of code is essential to making this work. Otherwise I only get the first country
        Exit For
        End If

Next HTMLA

End Sub
子测试()
Dim IE作为新的SHDocVw.InternetExplorer
将HTMLDoc设置为MSHTML.HTMLDocument
将HTMLAs设置为MSHTML.IHTMLElementCollection
将HTMLA设置为MSHTML.IHTMLElement
可见=真实
即“导航”https://www.iatatravelcentre.com/international-travel-document-news/1580226297.htm"
在IE.ReadyState ReadyState\u完成时执行此操作
环
设置HTMLDoc=IE.Document
ProcessHTMLPage HTMLDoc
设置HTMLAs=HTMLDoc.getElementsByTagName(“强”)
对于HTMLAs中的每个HTMLA
调试。打印HTMLA.innerText
如果HTMLA.getAttribute(“href”)=“http://x-rates.com/table/然后HTMLA.getAttribute(“rel”)=“ratestable”
HTMLA。点击
我不明白为什么,但前一行代码对实现这一点至关重要。否则我只能得到第一个国家
退出
如果结束
下一个HTMLA
端接头

请阅读宏中的注释。请随意以其他方式排序文本,或从字符串中获取消息日期或您想要的任何内容:

编辑2:我删除了第一次编辑,因为我指出了宏中的错误。但我现在已经修复了它们,并用这个编辑替换了宏代码

编辑3:我用现在可以使用的宏替换了第二个宏

Sub ExtractCoronaVirusCountryInfos()

  'To get the clear text for each country we must restruct the html code of parts of the page
  'It's necessary to delete some tags (p and span) and place some new tags (div and p)
  'To manipulate the html code like we need it we use tools of the dom (document object model)
  'and tools to make string operations on the html code

  Dim url As String
  Dim ie As Object
  Dim nodeTextContainer As Object
  Dim nodeAllP As Object
  Dim nodeOneP As Object
  Dim nodeNewBody As Object
  Dim nodeAllDiv As Object
  Dim nodeOneDiv As Object
  Dim htmlString As String
  Dim tableRow As Long
  Dim tableColumn As Long
  Dim countryName As String
  Dim infoDate As String
  Dim infoText As String
  Dim p As Long
  Dim openingArrowBracketIndex As Long
  Dim closingArrowBracketIndex As Long
  Dim openingRealBrTagComment As Long
  Dim closingRealBrTagComment As Long
  Dim openingRealBrTagStyle As Long
  Dim closingRealBrTagStyle As Long

  tableRow = 2
  tableColumn = 1
  url = "https://www.iatatravelcentre.com/international-travel-document-news/1580226297.htm"

  'Initialize Internet Explorer, set visibility,
  'call URL and wait until page is fully loaded
  Set ie = CreateObject("internetexplorer.application")
  ie.Visible = True
  ie.navigate url
  Do Until ie.readyState = 4: DoEvents: Loop
  'Application.Wait Now + TimeSerial(0, 0, 2)

  'Get the text container
  Set nodeTextContainer = ie.document.getElementsByClassName("middle")(0)
  '
  'Get all p-tags
  'They contain the text we want
  Set nodeAllP = nodeTextContainer.getElementsByTagName("p")
  '
  'Kick the p tags (only the opening and closing strings)
  'and concatinate the results of this operation
  'We can do this very easy by getting the innerhtml
  For Each nodeOneP In nodeAllP
    htmlString = htmlString & nodeOneP.innerhtml
  Next nodeOneP

  'Now we want to kick the span tags. But we can't do that in the same way
  'like with the p tags because there are nested span tags in the document
  'Let's see what's the problem with nested tags
  '
  'HTML code example with two nested span tags:
  '<span>
  '  <span>
  '    Data to show
  '  </span>
  '</span
  '
  'VBA code to build a node collection:
  'Set nodeAllSpan = ie.document.getElementsByTagName("span")
  '
  'Now there are two elements in the node collection:
  'nodeAllSpan(0) = <span><span>Data to show</span></span>
  'nodeAllSpan(1) = <span>Data to show</span>
  '
  'The Text we want is doubled!
  'If we take the innertext of the whole collection we get this:
  'Data to showData to show
  '
  'That is realy not our goal. Thats the reason we use string operations to delete
  'all span tags. For the closing parts </span> it's easy with replace. The opening
  'parts are unknown because they can have style information, attributes and even
  'more. So we must search first for '<span'. Than for '>' after the before found
  'position in string. Then we can delete the tag and go on for the next one
  '
  'First we replace the closing parts of all span tags with an empty string
  htmlString = Replace(htmlString, "</span>", "")
  '
  'With the following part of the macro we delete the opening parts of all span tags
  'We must search the whole string after each manipulation again so we need a loop
  'until there is no more span tag
  Do
    openingArrowBracketIndex = InStr(1, htmlString, "<span")
    closingArrowBracketIndex = InStr(openingArrowBracketIndex + 1, htmlString, ">")
    If openingArrowBracketIndex > 1 Then
      openingArrowBracketIndex = openingArrowBracketIndex - 1
    End If
    htmlString = Left(htmlString, openingArrowBracketIndex) & Mid(htmlString, closingArrowBracketIndex + 1)
  Loop Until openingArrowBracketIndex = 0

  'Now we have a string that starts with some text we don't need and some text at the end we don't need
  'But we also have a string with a pattern we can use to place new html tags which can be used to get
  'the text in that way we want
  '
  'The start text will lost automatically. The end text too with a little manipulation before placing all other new tags
  htmlString = Replace(htmlString, "<br><br><br>", "</div>")
  'Now we place the new structure
  htmlString = Replace(htmlString, "<br><strong><br>", "<strong>")
  htmlString = Replace(htmlString, "<strong><br><br>", "<strong>")
  htmlString = Replace(htmlString, "<br><br><strong>", "<strong>")
  htmlString = Replace(htmlString, "<br><br><a name=" & Chr(34) & "_GoBack" & Chr(34) & "></a><strong>", "<strong>")
  htmlString = Replace(htmlString, "<br><strong>", "<strong>")
  htmlString = Replace(htmlString, "<strong><br>", "<strong>")
  htmlString = Replace(htmlString, "<strong>", "</p></div><div><strong>")
  htmlString = Replace(htmlString, "</strong>", "</strong><p>")
  htmlString = Replace(htmlString, "<br>", "</p><p>")

  'Our htmlString contains all info we want. So we can
  'use the ie to generate a new dom object
  ie.Quit
  Set ie = CreateObject("internetexplorer.application")
  ie.Visible = True
  ie.navigate "about:blank"
  Do Until ie.readyState = 4: DoEvents: Loop

  'First we encapsulate our htmlString in a body tag to be able to query it afterwards
  htmlString = "<body>" & htmlString & "</body>"
  '
  'Than we use a little trick to get the htmlString as dom object
  ie.document.Write (htmlString)
  Set nodeNewBody = ie.document.getElementsByTagName("body")(0)

  'Now we can get the text like we want it
  '
  'The information for every single country is placed now in a div tag
  'By creating a node collection of all div tags we lost automatically
  'the not needed text at the start and at the end
  Set nodeAllDiv = ie.document.getElementsByTagName("div")
  '
  'Place data for each country in the excel table
  For Each nodeOneDiv In nodeAllDiv
    'Get country name
    countryName = Trim(nodeOneDiv.getElementsByTagName("strong")(0).innertext)
    ActiveSheet.Cells(tableRow, tableColumn).Value = countryName
    tableColumn = tableColumn + 1

    'Get date of message
    'The date string is placed allways in the first p tag
    infoDate = Trim(nodeOneDiv.getElementsByTagName("p")(0).innertext)
    ActiveSheet.Cells(tableRow, tableColumn).Value = infoDate
    tableColumn = tableColumn + 1

    'Get the message itself
    'The text of the message is placed from p tag 2 till the last p tag
    Set nodeAllP = nodeOneDiv.getElementsByTagName("p")
    '
    For p = 1 To nodeAllP.Length - 1
      infoText = infoText & Trim(nodeAllP(p).innertext) & Chr(10)
    Next p
    '
    'Write Infotext to table without the last new line
    ActiveSheet.Cells(tableRow, tableColumn).Value = Left(infoText, Len(infoText) - 1)
    infoText = ""
    tableColumn = 1
    tableRow = tableRow + 1
  Next nodeOneDiv
  ie.Quit
End Sub
Sub-extractCoronaVirusCountryInfo()
“为了获得每个国家的清晰文本,我们必须重新构建页面部分的html代码
'需要删除一些标记(p和span)并放置一些新标记(div和p)
'为了像我们需要的那样操纵html代码,我们使用dom(文档对象模型)的工具
'和对html代码进行字符串操作的工具
将url设置为字符串
模糊的物体
Dim nodeTextContainer作为对象
Dim NODEALP作为对象
作为对象的Dim NODENOEP
作为对象的弱节点实体
作为对象的Dim NODEALDIV
作为对象的Dim nodeOneDiv
作为字符串的Dim htmlString
昏暗的桌子排得一样长
暗表列一样长
将countryName设置为字符串
将infoDate设置为字符串
将文本设置为字符串
变暗p为长
暗开箭头支架索引与长度相同
暗淡的ClosingarrowBlacktindex与长
Dim OPENINGREALBRTAG注释尽可能长
Dim CLOSINGREALBRTAG注释尽可能长
Dim openingRealBrTagStyle尽可能长
昏暗的closingRealBrTagStyle尽可能长
tableRow=2
tableColumn=1
url=”https://www.iatatravelcentre.com/international-travel-document-news/1580226297.htm"
'初始化Internet Explorer,设置可见性,
'调用URL并等待页面完全加载
设置ie=CreateObject(“internetexplorer.application”)
可见=真实
浏览网址
直到ie.readyState=4:DoEvents:Loop
'Application.Wait Now+时序(0,0,2)
'获取文本容器
设置nodeTextContainer=ie.document.getElementsByClassName(“中间”)(0)
'
'获取所有p标签
“它们包含我们想要的文本
Set nodealp=nodeTextContainer.getElementsByTagName(“p”)
'
'踢p标记(仅开始和结束字符串)
"并说明这次行动的结果
“通过获取innerhtml,我们可以很容易地做到这一点
对于NODEALP中的每个NODENOEP
htmlString=htmlString&NODENOEP.innerhtml
下一站
“现在我们要踢跨度标签。但我们不能用同样的方法
'与p标记类似,因为文档中有嵌套的span标记
'让我们看看嵌套标记有什么问题
'
'带有两个嵌套span标记的HTML代码示例:
'
'  
"数据显示,
'  
"那么,
openingarrowblacktindex=openingarrowblacktindex-1
如果结束
htmlString=左(htmlString,打开箭头支架索引)和中(htmlString,关闭箭头支架索引+1)
循环直到打开箭头BRACKTINDEX=0
“现在我们有了一个字符串,它以一些不需要的文本开头,在末尾有一些不需要的文本
“但是我们也有一个字符串,它带有一个模式,我们可以用来放置新的html标记,这些标记可以用来获取
“以我们想要的方式发布文本
'
'开始文本将自动丢失。在放置所有其他新标记之前,也可以对结束文本进行一些操作
热媒