R 使用Xpath从HTML代码中提取注释
我试图找出以下HTML代码片段的注释中所写的内容,这只是该代码的一部分:R 使用Xpath从HTML代码中提取注释,r,xpath,rvest,R,Xpath,Rvest,我试图找出以下HTML代码片段的注释中所写的内容,这只是该代码的一部分: <table id="datalist1" cellspacing="0" border="0" style="border-width:1px;border-style:solid;width:100%;border-collapse:collapse;"> <tr> <td style="font-size:7pt;">
<table id="datalist1" cellspacing="0" border="0" style="border-width:1px;border-style:solid;width:100%;border-collapse:collapse;">
<tr>
<td style="font-size:7pt;">
<table width="100%" border="0" cellspacing="0" cellpadding="0">
<tr align="left">
<td width="50%" class="subhead1">
<!-- <b>IE CODE : 0514026049</b> --> ' I want text inside this comment
</td>
<td rowspan="9" valign="top">
<span id="datalist1_ctl00_lbl_p"></span>
</td>
</tr>
我正在尝试以下方法
1获取元素的Xpath
2阅读网页
3转到“注释”节点
4在注释中提取文本
library(rvest)
library(xml2)
url <- 'http://agriexchange.apeda.gov.in/ExportersDirectory/exporters_list.aspx?letter=Z'
webpage <- read_html(url)
' Xpath of comment element I want to grab
//*[@id="datalist1"]/tbody/tr[1]/td/table/tbody/tr[1]/td[1]/comment()
webpage %>%
html_nodes(xpath='//*[@id="datalist1"]/tbody/tr[1]/td/table/tbody/tr[1]/td[1]/comment()')%>%html_text()
#character(0) ' this is output
但是上面的代码给出了一个空字符串。因为我从未使用过Xpath,所以我不知道这是否是正确的方法
我必须对所有的注释元素运行这个。
简言之,我想我的问题是如何在HTML代码中提取注释?这可能会帮助您:
webpage %>%
html_nodes(xpath='//*[@id="datalist1"]') %>%
extract2(1) %>% html_nodes("tr") %>%
extract2(1) %>% html_nodes("td") %>%
extract2(2) %>% html_nodes(xpath = '//comment()') %>% extract2(15) %>% html_text()
尝试从XPath/table/tbody/tr[1]->/table//tr[1]中删除tbody,因为它可以通过浏览器添加到DOM中……现在您正在寻找XPath解决方案,您可能需要再次检查:是!当我检查站点的源代码时,tbody不在那里。我将尝试在不使用tbodyd的情况下使用它。您是只希望HTML文档中的所有注释,还是有特定的规则来指定您想要的注释?很难从你的例子中分辨出来。我想要所有带有标签的评论
library(rvest)
library(tidyverse)
pg <- read_html("http://agriexchange.apeda.gov.in/ExportersDirectory/exporters_list.aspx?letter=Z")
html_nodes(pg, xpath=".//comment()[contains(., 'IE CODE')]/../../..") %>% # target the comment then back up to the table
map_df(~{
# extract the <td> (column 1)
html_nodes(.x, xpath=".//td[1]") %>%
html_text(trim=TRUE) %>%
str_replace_all("[[:space:]]+", " ") -> tmp
# add in the comment to the "missing" <td> value
html_node(.x, xpath=".//comment()") %>%
html_text() %>%
stri_replace_all_regex("<b>|</b>", "") -> tmp[1]
# set it up for data frame-ing
set_names(as.list(tmp), sprintf("X%s", 1:8))
})
## # A tibble: 196 x 8
## X1 X2 X3
## <chr> <chr> <chr>
## 1 IE CODE : 0514026049 Z A M PRODUCTS 54 DAROOD GRAN SHAHPEER GATE MEERUT
## 2 IE CODE : AQDPV0923E Z CONNECT H-302, AIRFORCE NAVAL, ATHIPALAYAM PIRIVU, GANAPATHY, COIMBATORE
## 3 IE CODE : 2912000459 Z K INTERNATIONAL MUGHALPURA IST NEAR ISMAIL BEG KI MASJID MORADABAD
## 4 IE CODE : 0307069753 Z K R INTERNATIONAL CO. 4084, PLAZA SHOPPING CENTRE,104/142, SHERIF DEVJI STREET, MUMBAI,
## 5 IE CODE : 3117507531 Z S ENTERPRISES SURVEY NO 12,PLOT NO.64,FLAT NO 1, KAUSARBAUGH NIBM ROAD KONDHWA KHURD PUNE
## 6 IE CODE : 0500009503 Z. EXPORTS T-283, NEAR GURUDWARA BHAIJI B AHATA KIDARA,
## 7 IE CODE : 0713030658 Z. K. MANGO MANDI APMC YARD, RMC CHANNAPATNA, RAMANAGARA DISTRICT
## 8 IE CODE : 0599037351 Z.A. CRAFTS, A-56, GALI NO. 6, CHOUHAN BANGER, NEW SEELAM PUR, DELHI
## 9 IE CODE : 0609001353 Z.B.INTERNATIONAL 1ST FLOOR,25TH MILE STONE,AGRA MATHURA ROAD,VILL CHUMURA, POST-FARAH MATHURA
## 10 IE CODE : 0501009256 Z.D. EXPORTS J-51, EXTENSION, STREET NO. 12/3, RAMESH PARK, LAXMI NAGAR DELHI
## # ... with 186 more rows, and 5 more variables: X4 <chr>, X5 <chr>, X6 <chr>, X7 <chr>, X8 <chr>