使用rvest进行刮削-当标签不存在时，使用NAs完成刮削_R_Tags_Web Scraping_Rvest

使用rvest进行刮削-当标签不存在时，使用NAs完成刮削

r tags web-scraping

使用rvest进行刮削-当标签不存在时，使用NAs完成刮削,r,tags,web-scraping,rvest,R,Tags,Web Scraping,Rvest,我想解析此HTML:并从中获取以下元素： a） ptag，带有class:“normal\u encontrado” b） div带有类：“价格” 有时，某些产品中不存在p标签。如果是这种情况，则应向从该节点收集文本的向量添加NA 其思想是让两个向量具有相同的长度，然后将它们连接起来形成一个data.frame。有什么想法吗 HTML部分： <html> <head></head> <body> <div class="product_pr

我想解析此HTML:并从中获取以下元素：

a）

tag，带有

class:“normal\u encontrado”

b）

div

带有

类：“价格”

有时，某些产品中不存在

标签。如果是这种情况，则应向从该节点收集文本的向量添加

NA

其思想是让两个向量具有相同的长度，然后将它们连接起来形成一个

data.frame

。有什么想法吗

HTML部分：

<html>
<head></head>
<body>

<div class="product_price" id="product_price_186251">
  <p class="normal_encontrado">
    S/. 2,799.00
  </p>

  <div id="WC_CatalogEntryDBThumbnailDisplayJSPF_10461_div_10" class="price">
    S/. 2,299.00
  </div>    
</div>

<div class="product_price" id="product_price_232046">
  <div id="WC_CatalogEntryDBThumbnailDisplayJSPF_10461_div_10" class="price">
    S/. 4,999.00
  </div>
</div>
</body>
</html>



S/。2,799.00

S/。2,299.00
S/。4,999.00

R代码：

library(rvest)

page_source <- read_html("r.html")

r.precio.antes <- page_source %>%
html_nodes(".normal_encontrado") %>%
html_text()

r.precio.actual <- page_source %>%
html_nodes(".price") %>%
html_text()

库（rvest）
页面\来源%
html_text（）
r、 精确的实际百分比
html_节点（“.price”）%>%
html_text（）

这可能不是最惯用的方法，但您可以在

.product\u price

节点上使用Lappy，如下所示：

r.precio.antes <- page_source %>% html_nodes(".product_price") %>%
  lapply(. %>% html_nodes(".normal_encontrado") %>% html_text() %>% 
     ifelse(identical(., character(0)), NA, .)) %>% unlist

如果我想开发代码使其更清晰，首先我要隔离

.product\u price

节点：

product_nodes <- page_source %>% html_nodes(".product_price")

相反，我使用了

magrittr

语法来表示

lappy

，请参见示例

最后一个障碍是，如果找不到元素，将返回

字符（0）

，而不是您想要的

NA

。因此，我将

ifelse（相同（，字符（0）），NA，）

添加到lappy内的管道中，以解决此问题。

从目标向上一级，并在每个父元素上添加

lappy

：

library(xml2)
library(rvest)

pg <- read_html('<html>
<head></head>
<body>

<div class="product_price" id="product_price_186251">
  <p class="normal_encontrado">
    S/. 2,799.00
  </p>

  <div id="WC_CatalogEntryDBThumbnailDisplayJSPF_10461_div_10" class="price">
    S/. 2,299.00
  </div>    
</div>

<div class="product_price" id="product_price_232046">
  <div id="WC_CatalogEntryDBThumbnailDisplayJSPF_10461_div_10" class="price">
    S/. 4,999.00
  </div>
</div>
</body>
</html>')

prod <- html_nodes(pg, "div.product_price")
do.call(rbind, lapply(prod, function(x) {
  norm <- tryCatch(xml_text(xml_node(x, "p.normal_encontrado")),
                   error=function(err) {NA})
  price <- tryCatch(xml_text(xml_node(x, "div.price")),
                    error=function(err) {NA})
  data.frame(norm, price, stringsAsFactors=FALSE)
}))

##                     norm                  price
## 1 \n    S/. 2,799.00\n   \n    S/. 2,299.00\n  
## 2                   <NA> \n    S/. 4,999.00\n

库（xml2）
图书馆（rvest）
pg使用XML包使用xmlTreeParse
解析输入，然后使用xpathSApply
在product\u price
类div
节点上进行交互。对于每个这样的节点，anonyous函数获取div
和p
子节点的值。生成的字符矩阵m
被重新加工成一个数据框DF
，并清除列中的非点或数字字符，同时删除任何后跟非数字的点。将结果转换为数值。请注意，不需要对缺失的p
案例进行特殊处理
# input

Lines <- '<html>
<head></head>
<body>

<div class="product_price" id="product_price_186251">
  <p class="normal_encontrado">
    S/. 2,799.00
  </p>

  <div id="WC_CatalogEntryDBThumbnailDisplayJSPF_10461_div_10" class="price">
    S/. 2,299.00
  </div>    
</div>

<div class="product_price" id="product_price_232046">
  <div id="WC_CatalogEntryDBThumbnailDisplayJSPF_10461_div_10" class="price">
    S/. 4,999.00
  </div>
</div>
</body>
</html>'

# code to read input and produce a data.frame

library(XML)
doc <- xmlTreeParse(Lines, asText = TRUE, useInternalNodes = TRUE)

m <- xpathSApply(doc, "//div[@class = 'product_price']", function(node) {
  list(p = xmlValue(node[["p"]]), div = xmlValue(node[["div"]])) })

DF <- as.data.frame(t(m), stringsAsFactors = FALSE) # rework into data frame
DF[] <- lapply(DF, function(x) as.numeric(gsub("[^.0-9]|[.]\\D", "", x))) # clean

如果找不到标记，rvest将返回一个字符（0）。因此，假设您在每个div.product\u价格中最多找到一个当前价格和一个常规价格，您可以使用以下方法：
pacman::p_load("rvest", "dplyr")

get_prices <- function(node){
  r.precio.antes <- html_nodes(node, 'p.normal_encontrado') %>% html_text
  r.precio.actual <- html_nodes(node, 'div.price') %>% html_text

  data.frame(
    precio.antes = ifelse(length(r.precio.antes)==0, NA, r.precio.antes),
    precio.actual = ifelse(length(r.precio.actual)==0, NA, r.precio.actual), 
    stringsAsFactors=F
  )

}

doc <- read_html('test.html') %>% html_nodes("div.product_price")
lapply(doc, get_prices) %>%
  rbind_all

pacman:：p_加载（“rvest”、“dplyr”）
获取价格%
阿尔宾多

编辑：我误解了输入数据，因此更改了脚本，使其只处理一个html页面。
类似的内容可能会有所帮助-您介意解释一下代码吗？特别是这一部分：lappy（.%>%html\u节点（“.normal\u encontrado”）
为什么“.”在lappy之后？还有：（函数（x）ifelse（相同（x，字符（0）），NA，x））
。谢谢。事实上，我意识到你可以只做ifelse（相同（，字符（0）），NA，）
而不是（函数（x）…）语法。我已经开发了代码和解释。这更清楚吗？更清楚的方法，谢谢。我也喜欢Grothendick方法，但我从未使用过XML包。
# input

Lines <- '<html>
<head></head>
<body>

<div class="product_price" id="product_price_186251">
  <p class="normal_encontrado">
    S/. 2,799.00
  </p>

  <div id="WC_CatalogEntryDBThumbnailDisplayJSPF_10461_div_10" class="price">
    S/. 2,299.00
  </div>    
</div>

<div class="product_price" id="product_price_232046">
  <div id="WC_CatalogEntryDBThumbnailDisplayJSPF_10461_div_10" class="price">
    S/. 4,999.00
  </div>
</div>
</body>
</html>'

# code to read input and produce a data.frame

library(XML)
doc <- xmlTreeParse(Lines, asText = TRUE, useInternalNodes = TRUE)

m <- xpathSApply(doc, "//div[@class = 'product_price']", function(node) {
  list(p = xmlValue(node[["p"]]), div = xmlValue(node[["div"]])) })

DF <- as.data.frame(t(m), stringsAsFactors = FALSE) # rework into data frame
DF[] <- lapply(DF, function(x) as.numeric(gsub("[^.0-9]|[.]\\D", "", x))) # clean

> DF
     p  div
1 2799 2299
2   NA 4999

pacman::p_load("rvest", "dplyr")

get_prices <- function(node){
  r.precio.antes <- html_nodes(node, 'p.normal_encontrado') %>% html_text
  r.precio.actual <- html_nodes(node, 'div.price') %>% html_text

  data.frame(
    precio.antes = ifelse(length(r.precio.antes)==0, NA, r.precio.antes),
    precio.actual = ifelse(length(r.precio.actual)==0, NA, r.precio.actual), 
    stringsAsFactors=F
  )

}

doc <- read_html('test.html') %>% html_nodes("div.product_price")
lapply(doc, get_prices) %>%
  rbind_all