Web scraping 关于开始使用RVEST抓取电子商务网站有什么建议吗？_Web Scraping_E Commerce_Rvest

Web scraping 关于开始使用RVEST抓取电子商务网站有什么建议吗？

web-scraping e-commerce

Web scraping 关于开始使用RVEST抓取电子商务网站有什么建议吗？,web-scraping,e-commerce,rvest,Web Scraping,E Commerce,Rvest,我试图用rvest从一个电子商务网站上删除一些数据。我还没有找到任何好的例子来指导我。你知道吗让我们以我是如何开始的为例： library(rvest) library(purrr) #Specifying the url url_base <- 'https://telefonia.mercadolibre.com.uy/accesorios-celulares/' #Reading the HTML code from the website webpage <- read_

我试图用rvest从一个电子商务网站上删除一些数据。我还没有找到任何好的例子来指导我。你知道吗

让我们以我是如何开始的为例：

library(rvest)
library(purrr)

#Specifying the url
url_base <- 'https://telefonia.mercadolibre.com.uy/accesorios-celulares/'
#Reading the HTML code from the website
webpage <- read_html(url)

#Using CSS selectors to scrap the titles section
title_html <- html_nodes(webpage,'.main-title')
#Converting the title data to text
title <- html_text(title_html)
head(title)

#Using CSS selectors to scrap the price section
price <- html_nodes(webpage,'.item__price')
price <- html_text(price)
price

库（rvest）
图书馆（purrr）
#指定url
url_base刮取该信息并不困难，使用rvest是可行的。

您需要做的是获取所有的HREF并对其进行循环。为此，您需要使用html\u attr（）

以下代码应完成此工作：
library(tidyverse)
library(rvest)

#Specifying the url
url_base <- 'https://telefonia.mercadolibre.com.uy/accesorios-celulares/'
#You need to get href and loop on hrefs
all_pages <- url_base %>% read_html() %>% html_nodes(".pagination__page > a") %>% html_attr("href")
all_pages[1] <- url_base
#create an empty table to store results
result_table <- tibble() 
for(page in all_pages){
    page_source <- read_html(page)
    title <- html_nodes(page_source,'.item__info-title') %>% html_text()
    price <- html_nodes(page_source,'.item__price') %>% html_text()
    item_link <- html_nodes(page_source,'.item__info-title') %>% html_attr("href")
    temp_table <- tibble(title = title, price = price, item_link = item_link)
    result_table <- bind_rows(result_table,temp_table)
}

因此，我们可以这样做：
str_c("https://telefonia.mercadolibre.com.uy/accesorios-celulares/_Desde_",seq.int(from = 51,by = 50,length.out = 40))

刮掉每一页
让我们以本页为例：
pagesource%html\u text（）%%>%remove\u nt（）
产品描述%html\u节点（“.item-title\u primary”）%%>%html\u text（）%%>%remove\u nt（）
n\u意见%html\u节点（“.平均图例跨度：第n个子节点（1）”）%>%html\u文本（）
产品价格%html\u节点（“.price tag分数”）%>%html\u文本（）
当前的_表需要为我们提供一个最小的示例。由于电子商务网站通常有许多实时内容，因此通常的工作流需要使用Rselenium和rvest。您可以先阅读Rselenium和rvest教程，了解如何刮取网站。谢谢Yifu，我刚刚举了一些简单的例子。同时，我会看看硒，我不知道它…再次感谢你伊夫！！它正在工作。。。我还有两个疑问：1。为什么只有510个观测值，而观测值还有很多。2.如何从项目内部获取更多信息，例如位置、已接受付款，如果可能，甚至产品说明。您可以观察到，每个页面有50个项目，后缀之间有一个模式。我拿到了前10页。当您获得每个项目的链接时，您可以轻松地使用read_html（）循环项目链接并在每个项目页面中进行刮取。再次感谢您！我看到这个网站有一个API。在使用API时，有没有关于RVEST和SelectorGadget的建议？不过，与api完全不同。方法完全不同。您可以查看“httr”包的教程。还提供了一些教程：。您还可以搜索其他教程。如果你想坚持使用rvest，我更新了原始答案，希望对你有所帮助。该函数仅用于调整文本格式，很抱歉，我忘了在脚本中包含该函数。您可以安全地删除该功能，
str_c("https://telefonia.mercadolibre.com.uy/accesorios-celulares/_Desde_",seq.int(from = 51,by = 50,length.out = 40))

pagesource <- read_html("https://articulo.mercadolibre.com.uy/MLU-449598178-protector-funda-clear-cover-samsung-galaxy-note-8-_JM")
n_vendor <- pagesource %>% html_node(".item-conditions") %>% html_text() %>% remove_nt()
product_description <- pagesource %>% html_node(".item-title__primary") %>% html_text() %>% remove_nt()
n_opinion <- pagesource %>% html_node(".average-legend span:nth-child(1)") %>% html_text()
product_price <- pagesource %>% html_nodes(".price-tag-fraction") %>% html_text()
current_table <- tibble(product_description = product_description, 
       product_price = product_price,
       n_vendor = n_vendor,
       n_opinion = n_opinion)
print(current_table)
# A tibble: 1 x 4
product_description                               product_price n_vendor   n_opinion
<chr>                                             <chr>         <chr>      <chr>    
    1 Protector Funda Clear Cover Samsung Galaxy Note 8 14            14vendidos 2   

library(tidyverse)
library(rvest)

#Specifying the url
url_base <- 'https://telefonia.mercadolibre.com.uy/accesorios-celulares/'
#You need to get href and loop on hrefs
all_pages <- url_base %>% read_html() %>% html_nodes(".pagination__page > a") %>% html_attr("href")
all_pages <- c(url_base,
               str_c("https://telefonia.mercadolibre.com.uy/accesorios-celulares/_Desde_",
                     seq.int(from = 51,by = 50,length.out = 40)))
#create an empty table to store results
result_table <- tibble() 
for(page in all_pages[1:5]){ #as an example, only scrape the first 5 pages
    page_source <- read_html(page)
    title <- html_nodes(page_source,'.item__info-title') %>% html_text()
    price <- html_nodes(page_source,'.item__price') %>% html_text()
    item_link <- html_nodes(page_source,'.item__info-title') %>% html_attr("href")
    temp_table <- tibble(title = title, price = price, item_link = item_link)
    result_table <- bind_rows(result_table,temp_table)
}

#loop on result table(item_link):
product_table <- tibble()
for(i in 1:nrow(result_table)){
    pagesource <- read_html(result_table[[i,"item_link"]])
    n_vendor <- pagesource %>% html_node(".item-conditions") %>% html_text() %>% remove_nt()
    product_description <- pagesource %>% html_node(".item-title__primary") %>% html_text() %>% remove_nt()
    currency_symbol <- pagesource %>% html_node(".price-tag-symbol") %>% html_text()
    n_opinion <- pagesource %>% html_node(".average-legend span:nth-child(1)") %>% html_text()
    product_price <- pagesource %>% html_nodes(".price-tag-fraction") %>% html_text()
    current_table <- tibble(product_description = product_description, 
                            currency_symbol = currency_symbol,
                            product_price = product_price,
                            n_vendor = n_vendor,
                            n_opinion = n_opinion,
                            item_link = result_table[[i,"item_link"]])
    product_table <- bind_rows(product_table,current_table)
}