R 如何正确存储用于web抓取的html文件列表？_R_Xml_List_Web Scraping_Rvest

R 如何正确存储用于web抓取的html文件列表？

r xml list web-scraping

R 如何正确存储用于web抓取的html文件列表？,r,xml,list,web-scraping,rvest,R,Xml,List,Web Scraping,Rvest,使用rvest软件包，我从（例如标题、评论数量、平均评分、烹饪时间等），一切都很顺利然而，问题是存储包含配方页面的列表对象，并在以后将其重新加载到R中（以避免再次向页面发送垃圾请求）保存recipes_html对象时（请参见下面的示例代码） e、 g.as.rds文件，然后重新加载，我将收到以下错误： recipes_html "Error in doc_is_html(x$doc) : external pointer is not valid" 虽然它应该是

使用rvest软件包，我从（例如标题、评论数量、平均评分、烹饪时间等），一切都很顺利

然而，问题是存储包含配方页面的列表对象，并在以后将其重新加载到R中（以避免再次向页面发送垃圾请求）

保存recipes_html对象时（请参见下面的示例代码） e、 g.as.rds文件，然后重新加载，我将收到以下错误：

 recipes_html
"Error in doc_is_html(x$doc) : external pointer is not valid"

虽然它应该是

 recipes_html[[1]]
{html_document}
<html lang="en-US">
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; 
charset=UTF-8">\n<meta charset="UTF-8">\n<script type="tex ...
[2] <body class="post-template-default single single-post postid-42142 
single-format-standard woocommerce-no-js header-ful ...

那么，存储recipes_html list对象的适用方法是什么，这样我以后就可以简单地重新加载它并继续进行刮取？最好不要拆分列表并单独存储每页（作为我为在后续步骤中提取信息而构建的函数，在页面列表上循环，效果非常好。）

#可复制代码图书馆（rvest）示例链接%html\u节点（“.entry title a”）%%>%html\u attr（“href”）%%>%head（） #nytnyt函数在每个请求之间引入随机等待时间 Nytny为什么不解析文件并制作一个数据框来存储废弃数据中的信息？为什么不解析文件并制作一个数据框来存储废弃数据中的信息？ Error in UseMethod("write_html") : no applicable method for 'write_html' applied to an object of class "list" class(recipes_html) [1] "list" class(recipes_html[[1]]) [1] "xml_document" "xml_node" #Reproducible code library(rvest) example_links<-read_html("https://minimalistbaker.com/recipe-index/")%>%html_nodes(".entry-title a")%>%html_attr("href")%>%head() #nytnyt function to introduce random waiting time between each request nytnyt<- function (periods= c(1, 2.5)){ tictoc <-runif(1, periods[1], periods[2]) cat(paste0(Sys.time()), "- Sleeping for ", round(tictoc, 2), "seconds\n") Sys.sleep(tictoc) } recipes_html<-list() #save the list to for (i in 1:length(example_links)) { recipes_html[[i]]<-read_html(example_links[i]) print(i) nytnyt(periods = c(2, 4)) #make R wait between 2 to 4 seconds between each request } saveRDS(recipes_html, "example_recipes.rds") readRDS("example_recipes.rds") #saving and loading the object again won't work.