Xml r-使用xpathSApply获取表格数据时出现奇怪字符

Xml r-使用xpathSApply获取表格数据时出现奇怪字符,xml,r,web-scraping,Xml,R,Web Scraping,我在R中使用XML包来刮表。然而,我得到了一些奇怪的字符后,结果刮表。以下是获取表的脚本: TUrlhtml <- htmlParse(htmlfile) TUrlTable <- xpathSApply(TUrlhtml, "//table[@class='results_table']/tr/td", xmlValue) 但是,当我使用internet explorer或f

我在R中使用XML包来刮表。然而,我得到了一些奇怪的字符后,结果刮表。以下是获取表的脚本:

TUrlhtml <- htmlParse(htmlfile)
TUrlTable <- xpathSApply(TUrlhtml, 
                          "//table[@class='results_table']/tr/td",
                          xmlValue)
但是,当我使用internet explorer或firefox查看html文件时,html文件本身绝对没有问题。我应该如何解决它?非常感谢你的宝贵建议! 这是我的html文件:

<table id="main_table" class="results_table" bgcolor="white" cellspacing="0" cellpadding="0" border="0"><thead style="background-color: white;"><tr class="top_row" style="padding-bottom: 3px; color: #0033cc"><th style="vertical-align:bottom">
<!--si--><link rel="stylesheet" href="/classes/user_interface/vertical_menu_panel@12.41.004@.css" type="text/css" >      <div class="yui-skin-sam">
      <div class="vertical_menu_panel" style="display:none">
        <div class="hd" style="display:none"></div>
        <div class="bd">
          <ul> <li class="primary_sort_ascending"><img class="primary_sort_ascending" src="/images/primary_sort_ascending.gif"/>&nbsp;Sort Asc<span class="action" style="display:none">primary_sort_ascending</span></li> <li class="primary_sort_descending"><img class="primary_sort_descending" src="/images/primary_sort_descending.gif"/>&nbsp;Sort Desc<span class="action" style="display:none">primary_sort_descending</span></li> <li class="add_column"><img class="add_column" src="/images/add_column_left.gif"/>&nbsp;Add Column Here<span class="action" style="display:none">add_column</span></li> <li class="remove_column"><img class="remove_column" src="/images/subtract_small.gif"/>&nbsp;Remove Column<span class="action" style="display:none">remove_column</span></li>          </ul>
        </div>
        <div class="ft" style="display:none"></div>
      </div>
      </div>
<input style="" type="checkbox" id="master_check_box" title="Select All">
</th>      <td width="" valign="middle" align="center" nowrap><img alt="" border=0 src="/images/shim.gif" width="1" height="1"></td>
  <td width="75" valign="bottom" nowrap class="txtScreen column_header">&nbsp;<span style="text-decoration: underline;"><br><br><br>Ticker</span><IMG alt="" border="0" height="8" src="/images/b_sort_flat_rv.gif" width="10">&nbsp;  <img class="menu_icon" style="position:absolute; vertical-align:bottom; display:none" border="0" src="/images/arrow_down_button.gif"/>   <span class="attributes" style="display: none"><span class="column_id">0</span><span class="display_name">Ticker</span></span></td>
                          <td valign="bottom" align="left" nowrap class="txtScreen column_header  "><span href="#" onclick="return false"><span style="text-decoration: underline;">Company Name</span>                           <img class="menu_icon" style="position:absolute; vertical-align:bottom; display:none" border="0" src="/images/arrow_down_button.gif"/> <span class="attributes" style="display: none"><span class="column_id">2886069004</span><span class="display_name">Company Name</span></span></td>
                          <td valign="bottom" align="center" nowrap class="txtScreen column_header  "><span href="#" onclick="return false"><span style="text-decoration: underline;">Last<br>Quarter<br>Date</span>                           <img class="menu_icon" style="position:absolute; vertical-align:bottom; display:none" border="0" src="/images/arrow_down_button.gif"/> <span class="attributes" style="display: none"><span class="column_id">2886070004</span><span class="display_name">Last Quarter Date</span></span></td>
                          <td valign="bottom" align="center" nowrap class="txtScreen column_header  "><span href="#" onclick="return false"><span style="text-decoration: underline;">Price</span>                           <img class="menu_icon" style="position:absolute; vertical-align:bottom; display:none" border="0" src="/images/arrow_down_button.gif"/> <span class="attributes" style="display: none"><span class="column_id">2886071004</span><span class="display_name">Price</span></span></td>
                          <td valign="bottom" align="center" nowrap class="txtScreen column_header  "><span href="#" onclick="return false"><span style="text-decoration: underline;">Date of<br>Last Report</span>                           <img class="menu_icon" style="position:absolute; vertical-align:bottom; display:none" border="0" src="/images/arrow_down_button.gif"/> <span class="attributes" style="display: none"><span class="column_id">2886072004</span><span class="display_name">Date of Last Report</span></span></td>
        <td width="100%" class="txtScreen">&nbsp;</td></tr>
      </thead>
        <tr bgcolor="#eeeeee">          <th style="height: 15px">
          <input class="select_cb" type="checkbox" style="height: 13px">
          <span class="attributes" style="display: none"><span class="security_id">28411</span></span>
          </th>
                <td valign="middle" align="center"></td>



        <td align="left" nowrap class="txtScreen">&nbsp;<a href="/stocks/stocks.phtml?security_id=28411&ticker=2+HK" target="_parent" >2 HK</a></td>
        <td align="left" nowrap class="txtScreen" title="Company Name">Clp Holdings Limited&nbsp;&nbsp;</td>
        <td align="right" nowrap class="txtScreen" title="Last&#13;Quarter&#13;Date">&nbsp;&nbsp;Dec-00&nbsp;&nbsp;&nbsp;</td>
        <td align="right" nowrap class="txtScreen" title="Price">66.60&nbsp;&nbsp;&nbsp;</td>
        <td align="right" nowrap class="txtScreen" title="Date of&#13;Last Report">&nbsp;&nbsp;26-Feb-12&nbsp;&nbsp;&nbsp;</td>
<td bgcolor="#ffffff">&nbsp;</td> </tr>
        <tr bgcolor="#ffffff">          <th style="height: 15px">
          <input class="select_cb" type="checkbox" style="height: 13px">
          <span class="attributes" style="display: none"><span class="security_id">48569</span></span>
          </th>
                <td valign="middle" align="center"></td>



        <td align="left" nowrap class="txtScreen">&nbsp;<a href="/stocks/stocks.phtml?security_id=48569&ticker=3+HK" target="_parent" >3 HK</a></td>
        <td align="left" nowrap class="txtScreen" title="Company Name">Hong Kong & China Gas Co&nbsp;&nbsp;</td>
        <td align="right" nowrap class="txtScreen" title="Last&#13;Quarter&#13;Date">&nbsp;&nbsp;Jun-03&nbsp;&nbsp;&nbsp;</td>
        <td align="right" nowrap class="txtScreen" title="Price">21.40&nbsp;&nbsp;&nbsp;</td>
        <td align="right" nowrap class="txtScreen" title="Date of&#13;Last Report">&nbsp;&nbsp;19-Mar-12&nbsp;&nbsp;&nbsp;</td>
<td bgcolor="#ffffff">&nbsp;</td> </tr>
</table>

  • sort Ascprimary\u sort\u summeding
  • sort Descprimary\u sort\u descending
  • 在此添加列添加列
  • 删除列删除列


    自动售票机 公司名称288609004公司名称 上一季度
    日期2886070004上一季度日期 价格2886071004价格 上次报告日期2886072004上次报告日期 28411 中电控股有限公司 12月00日 66.60 2012年2月26日 48569 香港中华煤气公司 2003年6月 21.40 2012年3月19日
这是由非中断空间引起的

这是unicode字符0xA0,在UTF-8中是\xC2\xA0


我不知道为什么它也会在两者之间插入EE7B。

在您的xml示例中..没有表格筛选结果表格..也许您的意思是
[@class=results\u table]
?@agstudy:是的,您是对的,我的意思是[@class=results\u table],很抱歉输入错误。你对如何解决这个问题有什么想法吗?我对html非常熟悉;)我想获取表值(没有任何格式),包括表中的标题。请尝试此
as.data.frame(readHTMLTable(TurlHtml)
@agstudy:谢谢,我已经尝试过readHTMLTable(TurlHtml),但仍然会得到奇怪的字符,当我尝试as.data.frame时,会出现“make.names(vnames,unique=TRUE)中的错误”:无效的多字节字符串1“谢谢你的提醒,你有什么解决办法吗?有趣的是,当我使用internet explorer或firefox打开html时,查看表中的文本一点问题都没有,那么我想在R中应该有一些技巧来处理这个问题?你可以使用正则表达式在解析前用空格字符替换所有的字符HTMLTank支持这个建议。只是想知道为什么internet explorer和firefox可以处理这个问题,但R不能?EE7B可能只是R实体解码器中的一个bug(因为它不一致,没有意义,有时插入,有时不插入),但\xC2\xA0实际上是解析器处理它的正确方法。如果在屏幕上显示,它应该显示为单个空格。使用它很困难。如果解析器将其替换为空格,您将无法再次将文件原封不动地保存为html。
<table id="main_table" class="results_table" bgcolor="white" cellspacing="0" cellpadding="0" border="0"><thead style="background-color: white;"><tr class="top_row" style="padding-bottom: 3px; color: #0033cc"><th style="vertical-align:bottom">
<!--si--><link rel="stylesheet" href="/classes/user_interface/vertical_menu_panel@12.41.004@.css" type="text/css" >      <div class="yui-skin-sam">
      <div class="vertical_menu_panel" style="display:none">
        <div class="hd" style="display:none"></div>
        <div class="bd">
          <ul> <li class="primary_sort_ascending"><img class="primary_sort_ascending" src="/images/primary_sort_ascending.gif"/>&nbsp;Sort Asc<span class="action" style="display:none">primary_sort_ascending</span></li> <li class="primary_sort_descending"><img class="primary_sort_descending" src="/images/primary_sort_descending.gif"/>&nbsp;Sort Desc<span class="action" style="display:none">primary_sort_descending</span></li> <li class="add_column"><img class="add_column" src="/images/add_column_left.gif"/>&nbsp;Add Column Here<span class="action" style="display:none">add_column</span></li> <li class="remove_column"><img class="remove_column" src="/images/subtract_small.gif"/>&nbsp;Remove Column<span class="action" style="display:none">remove_column</span></li>          </ul>
        </div>
        <div class="ft" style="display:none"></div>
      </div>
      </div>
<input style="" type="checkbox" id="master_check_box" title="Select All">
</th>      <td width="" valign="middle" align="center" nowrap><img alt="" border=0 src="/images/shim.gif" width="1" height="1"></td>
  <td width="75" valign="bottom" nowrap class="txtScreen column_header">&nbsp;<span style="text-decoration: underline;"><br><br><br>Ticker</span><IMG alt="" border="0" height="8" src="/images/b_sort_flat_rv.gif" width="10">&nbsp;  <img class="menu_icon" style="position:absolute; vertical-align:bottom; display:none" border="0" src="/images/arrow_down_button.gif"/>   <span class="attributes" style="display: none"><span class="column_id">0</span><span class="display_name">Ticker</span></span></td>
                          <td valign="bottom" align="left" nowrap class="txtScreen column_header  "><span href="#" onclick="return false"><span style="text-decoration: underline;">Company Name</span>                           <img class="menu_icon" style="position:absolute; vertical-align:bottom; display:none" border="0" src="/images/arrow_down_button.gif"/> <span class="attributes" style="display: none"><span class="column_id">2886069004</span><span class="display_name">Company Name</span></span></td>
                          <td valign="bottom" align="center" nowrap class="txtScreen column_header  "><span href="#" onclick="return false"><span style="text-decoration: underline;">Last<br>Quarter<br>Date</span>                           <img class="menu_icon" style="position:absolute; vertical-align:bottom; display:none" border="0" src="/images/arrow_down_button.gif"/> <span class="attributes" style="display: none"><span class="column_id">2886070004</span><span class="display_name">Last Quarter Date</span></span></td>
                          <td valign="bottom" align="center" nowrap class="txtScreen column_header  "><span href="#" onclick="return false"><span style="text-decoration: underline;">Price</span>                           <img class="menu_icon" style="position:absolute; vertical-align:bottom; display:none" border="0" src="/images/arrow_down_button.gif"/> <span class="attributes" style="display: none"><span class="column_id">2886071004</span><span class="display_name">Price</span></span></td>
                          <td valign="bottom" align="center" nowrap class="txtScreen column_header  "><span href="#" onclick="return false"><span style="text-decoration: underline;">Date of<br>Last Report</span>                           <img class="menu_icon" style="position:absolute; vertical-align:bottom; display:none" border="0" src="/images/arrow_down_button.gif"/> <span class="attributes" style="display: none"><span class="column_id">2886072004</span><span class="display_name">Date of Last Report</span></span></td>
        <td width="100%" class="txtScreen">&nbsp;</td></tr>
      </thead>
        <tr bgcolor="#eeeeee">          <th style="height: 15px">
          <input class="select_cb" type="checkbox" style="height: 13px">
          <span class="attributes" style="display: none"><span class="security_id">28411</span></span>
          </th>
                <td valign="middle" align="center"></td>



        <td align="left" nowrap class="txtScreen">&nbsp;<a href="/stocks/stocks.phtml?security_id=28411&ticker=2+HK" target="_parent" >2 HK</a></td>
        <td align="left" nowrap class="txtScreen" title="Company Name">Clp Holdings Limited&nbsp;&nbsp;</td>
        <td align="right" nowrap class="txtScreen" title="Last&#13;Quarter&#13;Date">&nbsp;&nbsp;Dec-00&nbsp;&nbsp;&nbsp;</td>
        <td align="right" nowrap class="txtScreen" title="Price">66.60&nbsp;&nbsp;&nbsp;</td>
        <td align="right" nowrap class="txtScreen" title="Date of&#13;Last Report">&nbsp;&nbsp;26-Feb-12&nbsp;&nbsp;&nbsp;</td>
<td bgcolor="#ffffff">&nbsp;</td> </tr>
        <tr bgcolor="#ffffff">          <th style="height: 15px">
          <input class="select_cb" type="checkbox" style="height: 13px">
          <span class="attributes" style="display: none"><span class="security_id">48569</span></span>
          </th>
                <td valign="middle" align="center"></td>



        <td align="left" nowrap class="txtScreen">&nbsp;<a href="/stocks/stocks.phtml?security_id=48569&ticker=3+HK" target="_parent" >3 HK</a></td>
        <td align="left" nowrap class="txtScreen" title="Company Name">Hong Kong & China Gas Co&nbsp;&nbsp;</td>
        <td align="right" nowrap class="txtScreen" title="Last&#13;Quarter&#13;Date">&nbsp;&nbsp;Jun-03&nbsp;&nbsp;&nbsp;</td>
        <td align="right" nowrap class="txtScreen" title="Price">21.40&nbsp;&nbsp;&nbsp;</td>
        <td align="right" nowrap class="txtScreen" title="Date of&#13;Last Report">&nbsp;&nbsp;19-Mar-12&nbsp;&nbsp;&nbsp;</td>
<td bgcolor="#ffffff">&nbsp;</td> </tr>
</table>