R-从网站页面中删除节点-如何在最终数据框中插入单个页面链接_R_Web Scraping_Hyperlink

R-从网站页面中删除节点-如何在最终数据框中插入单个页面链接

r web-scraping hyperlink

R-从网站页面中删除节点-如何在最终数据框中插入单个页面链接,r,web-scraping,hyperlink,R,Web Scraping,Hyperlink,亲爱的Stackoverflow用户：，我试图从一个网站的不同页面中刮取两个节点（今天的心理学，这些页面指的是心理健康专业人士，MHP）首先，我创建一个刮片函数，然后创建一个包含该函数的循环。最终，我能够创建一个数据帧。然而，我想包括——作为第三个变量——到我所抓取的各个页面的完整链接。如何在数据框中包含此信息这是一个循环： j <- 1 #set the running variable = to 1 (the MHP id will increase by one) MHP_

亲爱的Stackoverflow用户：，我试图从一个网站的不同页面中刮取两个节点（今天的心理学，这些页面指的是心理健康专业人士，MHP）

首先，我创建一个刮片函数，然后创建一个包含该函数的循环。最终，我能够创建一个数据帧。然而，我想包括——作为第三个变量——到我所抓取的各个页面的完整链接。如何在数据框中包含此信息

这是一个循环：

j <- 1 #set the running variable = to 1 (the MHP id will increase by one)
MHP_codes <-  c(150130:150170) #therapist identifier range
df_list <- vector(mode = "list", length(MHP_codes)) #set up the vector 
                                                    #that collects individual
                                                    #MHP information
for(code1 in MHP_codes) {
  delayedAssign("do.next", {next})
  URL <- paste0('https://www.psychologytoday.com/us/therapists/illinois/', code1)
  #Reading the HTML code from the website
  URL <-  tryCatch(read_html(URL), 
           error = function(e) force(do.next)) 
                #tryCatch catches also those with missing URL
                #however, if an error occur in those pages
                #the loop stops; this is why we need delayedAssign
                #and force(do.next) in tryCatch
  df_list[[j]] <- getProfile(URL) #the function puts the scraped data
                                  #into a row  
  na.omit(df_list) #this function eliminates rows with only NAs, which happens if the URL does not exist
  j <- j + 1
}
final_df <- rbind.fill(df_list) #gather the vectors into one unique data set

j任何一种选择都有其利弊。如果您确定URL和数据框中的行数之间存在一对一的关系（即没有丢失的数据），您可以只cbind
URL到final_df的向量：cbind（final_df，paste0（'https://www.psychologytoday.com/us/therapists/illinois/，MHP_代码）
非常有趣。我认为即使URL不存在，它也应该可以工作。事实上，我应该更精确：所有URL都存在，但其中许多都没有任何内容需要刮取（即“code1”与任何MHP没有关联；在本例中：）@DaveT如果我干预循环，我是否应该在下面写na.omit（）类似于：cbind的内容（df_列表，粘贴0（“”，代码1）。如果我将其插入na之前。省略（），本来完全用NAs填充的行现在将具有非NA值的变量，因此不会删除此无用的观察值。是的，如果某个URL不返回任何数据，则我以前的注释将不起作用。如果您在循环中进行干预，则类似于：df_list[[j]]