使用Mathematica从html中的特定标记提取文本

使用Mathematica从html中的特定标记提取文本,html,xml,parsing,tags,wolfram-mathematica,Html,Xml,Parsing,Tags,Wolfram Mathematica,对于具有类似此结构的html的页面: <tr class=""> <td class="number">1</td> <td class="name"><a href="..." >Jack Green</a></td> <td class="score-cell "> <sp

对于具有类似此结构的html的页面:

          <tr class="">
            <td class="number">1</td>
            <td class="name"><a href="..." >Jack Green</a></td>
            <td class="score-cell ">
              <span class="display">98
                <span class="tooltip column1"></span>
              </span>
            </td>
            <td class="score-cell ">
              ...
            </td>
          ...
          <tr class="">
            <td class="number">2</td>
            <td class="name"><a href="..." target="_top">Nicole Smith</a></td>
            <td class="score-cell ">
             ...
            </td>

1.
98
...
...
2.
...
如何仅从name标签中提取文本,以列表
{Jack Green,Nicole Smith}
结束?我希望有一些优雅的方法

input =
  "          <tr class=\"\">
              <td class=\"number\">1</td>
              <td class=\"name\"><a href=\"...\" >Jack Green</a></td>
              <td class=\"score-cell \">
                <span class=\"display\">98
                  <span class=\"tooltip column1\"></span>
                </span>
              </td>
              <td class=\"score-cell \">
                ...
              </td>
            ...
            <tr class=\"\">
              <td class=\"number\">2</td>
              <td class=\"name\"><a href=\"...\" target=\"_top\">Nicole Smith</a></td>
              <td class=\"score-cell \">
               ...
              </td>";

(* Eliminate unnecessary whitespace and add a start character *)
html = StringJoin["X", StringReplace[StringTrim[input],
   {"\n" ~~ " " .. -> "", ">" ~~ " " .. ~~ "<" -> "><"}]];

(* Find the tags and positions of tags containing 'name' *)
tags = StringCases[html, "<" ~~ Except[">"] .. ~~ ">"];
nametagpositions = Position[StringMatchQ[ToLowerCase /@ tags, "*name*"], True];

(* Split on the tags and extract on the name tag positions *)
splits = StringSplit[html, "<" ~~ Except[">"] .. ~~ ">"];
Extract[splits, nametagpositions + 2]
{1,2}

{1,2}

{0,1,2}


这太棒了,不知道封装成函数是否更好。让我在一些网页源html上测试一下。感谢您的字符串操作技巧。因此我继续测试它,其中之一是:
input=Import[”http://games.crossfit.com/scores/leaderboard.php?stage=5&sort=0&division=1®ion=0®ional=6&numberperpage=60&userid=0&competition=0&frontpage=0&expanded=1&year=15&full=1&showtoggles=0&hidedropdowns=1&showathleteac=1&athletename=&scaled=0",“来源”]和输出似乎无法正确生成输出。问题的标签是
,我已经修复了代码。现在可以在CrossFit页面上使用。有两个更改:添加起始字符和增加
提取位置(在最后一行)。
html = "aa1aaa2aa";
splits = StringSplit[html, "a"]
html = "aaaaaaa1aaa2aaaaaaa";
splits = StringSplit[html, "a"]
html = "0aaaaaaa1aaa2aaaaaaa";
splits = StringSplit[html, "a"]