如何正确关闭连接，使我赢得'；“得不到”；文件（con，“r”中出现错误：所有连接都在使用中；当使用；“阅读资料”；及；“tryCatch”；_R_Url_Try Catch_Readlines

如何正确关闭连接，使我赢得'；“得不到”；文件（con，“r”中出现错误：所有连接都在使用中；当使用；“阅读资料”；及；“tryCatch”；

r url

如何正确关闭连接，使我赢得'；“得不到”；文件（con，“r”中出现错误：所有连接都在使用中；当使用；“阅读资料”；及；“tryCatch”；,r,url,try-catch,readlines,R,Url,Try Catch,Readlines,我有一个特定域（pixilink.com）的URL列表（超过4000个），我想做的是确定提供的域是图片还是视频。为此，我使用了这里提供的解决方案：并编写了如下代码： #Function to get the value of initial_mode from the URL urlmode <- function(x){ mycontent <- readLines(x) mypos <- grep("initial_mode = ", mycon

我有一个特定域（pixilink.com）的URL列表（超过4000个），我想做的是确定提供的域是图片还是视频。为此，我使用了这里提供的解决方案：并编写了如下代码：

#Function to get the value of initial_mode from the URL
urlmode <- function(x){
  mycontent <- readLines(x)
  mypos <- grep("initial_mode = ", mycontent)
  
  if(grepl("0", mycontent[mypos])){
    return("picture")
  } else if(grepl("tour", mycontent[mypos])){
    return("video")
  } else{
    return(NA)
  }
}

#函数从URL获取初始_模式的值
urlmode因此，我猜测发生这种情况的原因是，您没有关闭通过tryCatch（）
和使用readLines（）
通过urlmode（）
打开的连接。我不确定urlmode（）
将如何在中使用，因此我尽可能地简化了它（事后看来，这做得很糟糕，我深表歉意）。因此，我冒昧地重写了urlmode（）
，试图让它更健壮一点，以完成手头似乎更为庞大的任务
我认为代码中的注释应该会有所帮助，请看下面的内容：
#Updated URL mode function with better 
#URL checking, connection handling,
#and "mode" investigation
urlmode <- function(x){
  
  #Check if URL is good to go
  if(!httr::http_error(x)){
    
    #Test cases
    #x <- "www.pixilink.com/3"
    #x <- "https://www.pixilink.com/93320"
    #x <- "https://www.pixilink.com/93313"
    
    #Then since there are redirect shenanigans
    #Get the actual URL the input points to
    #It should just be the input URL if there is
    #no redirection
    #This is important as this also takes care of
    #checking whether http or https need to be prefixed
    #in case the input URL is supplied without those
    #(this can cause problems for url() below)
    myx <- httr::HEAD(x)$url
    
    #Then check for what the default mode is
    mycon <- url(myx)
    open(mycon, "r")
    mycontent <- readLines(mycon)
    
    mypos <- grep("initial_mode = ", mycontent)
    
    #Close the connection since it's no longer
    #necessary
    close(mycon)
    
    #Some URLs with weird formats can return 
    #empty on this one since they don't
    #follow the expected format.
    #See for example: "https://www.pixilink.com/clients/899/#3"
    #which is actually
    #redirected from "https://www.pixilink.com/3"
    #After that, evaluate what's at mypos, and always 
    #return the actual URL
    #along with the result
    if(!purrr::is_empty(mypos)){
      
      #mystr<- stringr::str_extract(mycontent[mypos], "(?<=initial_mode\\s\\=).*")
      mystr <- stringr::str_extract(mycontent[mypos], "(?<=\').*(?=\')")
      return(c(myx, mystr))
      #return(mystr)
      
      #So once all that is done, check if the line at mypos
      #contains a 0 (picture), tour (video)
      #if(grepl("0", mycontent[mypos])){
      #  return(c(myx, "picture"))
        #return("picture")
      #} else if(grepl("tour", mycontent[mypos])){
      #  return(c(myx, "video"))
        #return("video")
      #}
      
    } else{
      #Valid URL but not interpretable
      return(c(myx, "uninterpretable"))
      #return("uninterpretable")
    }
    
  } else{
    #Straight up invalid URL
    #No myx variable to return here
    #Just x
    return(c(x, "invalid"))
    #return("invalid")
  }
  
}


#--------
#Sample code execution
library(purrr)
library(parallel)
library(future.apply)
library(httr)
library(stringr)
library(progressr)
library(progress)


#All future + progressr related stuff
#learned courtesy 
#https://stackoverflow.com/a/62946400/9494044
#Setting up parallelized execution
no_cores <- parallel::detectCores()
#The above setup will ensure ALL cores
#are put to use
clust <- parallel::makeCluster(no_cores)
future::plan(cluster, workers = clust)

#Progress bar for sanity checking
progressr::handlers(progressr::handler_progress(format="[:bar] :percent :eta :message"))


#Website's base URL
baseurl <- "https://www.pixilink.com"

#Using future_lapply() to recursively apply urlmode()
#to a sequence of the URLs on pixilink in parallel
#and storing the results in sitetype
#Using a future chunk size of 10
#Everything is wrapped in with_progress() to enable the
#progress bar

#
range <- 93310:93350
#range <- 1:10000
progressr::with_progress({
  myprog <- progressr::progressor(along = range)
  sitetype <- do.call(rbind, future_lapply(range, function(b, x){
    myprog() ##Progress bar signaller
    myurl <- paste0(b, "/", x)
    cat("\n", myurl, " ")
    myret <- urlmode(myurl)
    cat(myret, "\n")
    return(c(myurl, myret))
  }, b = baseurl, future.chunk.size = 10))
  
})




#Converting into a proper data.frame
#and assigning column names
sitetype <- data.frame(sitetype)
names(sitetype) <- c("given_url", "actual_url", "mode")

#A bit of wrangling to tidy up the mode column
sitetype$mode <- stringr::str_replace(sitetype$mode, "0", "picture")


head(sitetype)
#                        given_url                     actual_url        mode
# 1 https://www.pixilink.com/93310 https://www.pixilink.com/93310     invalid
# 2 https://www.pixilink.com/93311 https://www.pixilink.com/93311     invalid
# 3 https://www.pixilink.com/93312 https://www.pixilink.com/93312 floorplan2d
# 4 https://www.pixilink.com/93313 https://www.pixilink.com/93313     picture
# 5 https://www.pixilink.com/93314 https://www.pixilink.com/93314 floorplan2d
# 6 https://www.pixilink.com/93315 https://www.pixilink.com/93315        tour

unique(sitetype$mode)
# [1] "invalid"     "floorplan2d" "picture"     "tour" 

#--------

#更新了URL模式功能，具有更好的
#URL检查、连接处理、，
#和“模式”调查
urlmode为什么不在使用完连接后关闭它们？@IRTFM，我已经用closeAllconnection（）
尝试过了，但仍然收到了相同的错误消息感谢Dunios为此付出的所有努力。我有几个问题，但我宁愿从我收到的“错误”信息开始。为了完全理解您的代码，我从一个示例“www.pixilink.com/3”开始。问题是，当我想为此链接运行open（mycon，“r”）
时，我在open.connection（mycon，“r”）：无法打开连接时收到一条错误消息error。不过，对于其他例子，我没有这个错误。然后我跳过了这一步，为我的url列表运行了代码，并再次在232rd url上运行了代码，该url是https://www.pixilink.com/141451#mode=tour
我收到了相同的错误消息，该错误消息说，打开连接时出错（mycon，“r”）：无法打开连接
。这种联系实际上是存在的，我不知道为什么会发生这种情况。对此有何评论？我现在绝对可以跳过这些变量，但我只是想再次完全理解代码我想说谢谢你的努力，我知道你应该有完整的数据集来解决所有问题。我在这里共享了它，以便您可以检查并更好地了解我的数据集中发生了什么。希望您能帮助我解决此问题。此类错误的另一个示例是此链接https://www.pixilink.com/140079#mode=tour
，当我运行它时，会出现如下错误：在open.connection（mycon，“r”）中出错：无法打开连接另外：警告消息：在打开的连接中（mycon，“r”）：无法打开URL'https://www.pixilink.com/140079#mode=tour“：HTTP状态为“400错误请求”
此链接也存在，我不确定为什么R无法与该链接建立连接。不，我在您提供的测试URL上运行了完全相同的函数。是的，很奇怪！我在StackOverflow上问过这个问题，其他人提到他在Linux上运行了它，他没有得到任何错误！您是否也在使用Linux？我使用的是Windows10，也许这就是原因。为了使代码为我工作，我必须分离所有这些链接，这导致连接错误（47个链接），然后代码为我完美运行
a <- subset(new_df, new_df$host=="www.pixilink.com")
vec <- a[['V']]
vec <- vec[1:1000] # only chose first 1000 rows

tt <- numeric(length(vec)) # checking validity of url
for (i in 1:length(vec)){
  tt[i] <- readUrl(vec[i])
  print(i)
}    
g <- data.frame(vec,tt)
g2 <- g[which(!is.na(g$tt)),] #only valid url

dd <- numeric(nrow(g2))
for (j in 1:nrow(g2)){
  dd[j] <- urlmode(g2[j,1])      
}    
Final <- cbind(g2,dd)
Final <- left_join(g, Final, by = c("vec" = "vec"))

#Updated URL mode function with better 
#URL checking, connection handling,
#and "mode" investigation
urlmode <- function(x){
  
  #Check if URL is good to go
  if(!httr::http_error(x)){
    
    #Test cases
    #x <- "www.pixilink.com/3"
    #x <- "https://www.pixilink.com/93320"
    #x <- "https://www.pixilink.com/93313"
    
    #Then since there are redirect shenanigans
    #Get the actual URL the input points to
    #It should just be the input URL if there is
    #no redirection
    #This is important as this also takes care of
    #checking whether http or https need to be prefixed
    #in case the input URL is supplied without those
    #(this can cause problems for url() below)
    myx <- httr::HEAD(x)$url
    
    #Then check for what the default mode is
    mycon <- url(myx)
    open(mycon, "r")
    mycontent <- readLines(mycon)
    
    mypos <- grep("initial_mode = ", mycontent)
    
    #Close the connection since it's no longer
    #necessary
    close(mycon)
    
    #Some URLs with weird formats can return 
    #empty on this one since they don't
    #follow the expected format.
    #See for example: "https://www.pixilink.com/clients/899/#3"
    #which is actually
    #redirected from "https://www.pixilink.com/3"
    #After that, evaluate what's at mypos, and always 
    #return the actual URL
    #along with the result
    if(!purrr::is_empty(mypos)){
      
      #mystr<- stringr::str_extract(mycontent[mypos], "(?<=initial_mode\\s\\=).*")
      mystr <- stringr::str_extract(mycontent[mypos], "(?<=\').*(?=\')")
      return(c(myx, mystr))
      #return(mystr)
      
      #So once all that is done, check if the line at mypos
      #contains a 0 (picture), tour (video)
      #if(grepl("0", mycontent[mypos])){
      #  return(c(myx, "picture"))
        #return("picture")
      #} else if(grepl("tour", mycontent[mypos])){
      #  return(c(myx, "video"))
        #return("video")
      #}
      
    } else{
      #Valid URL but not interpretable
      return(c(myx, "uninterpretable"))
      #return("uninterpretable")
    }
    
  } else{
    #Straight up invalid URL
    #No myx variable to return here
    #Just x
    return(c(x, "invalid"))
    #return("invalid")
  }
  
}


#--------
#Sample code execution
library(purrr)
library(parallel)
library(future.apply)
library(httr)
library(stringr)
library(progressr)
library(progress)


#All future + progressr related stuff
#learned courtesy 
#https://stackoverflow.com/a/62946400/9494044
#Setting up parallelized execution
no_cores <- parallel::detectCores()
#The above setup will ensure ALL cores
#are put to use
clust <- parallel::makeCluster(no_cores)
future::plan(cluster, workers = clust)

#Progress bar for sanity checking
progressr::handlers(progressr::handler_progress(format="[:bar] :percent :eta :message"))


#Website's base URL
baseurl <- "https://www.pixilink.com"

#Using future_lapply() to recursively apply urlmode()
#to a sequence of the URLs on pixilink in parallel
#and storing the results in sitetype
#Using a future chunk size of 10
#Everything is wrapped in with_progress() to enable the
#progress bar

#
range <- 93310:93350
#range <- 1:10000
progressr::with_progress({
  myprog <- progressr::progressor(along = range)
  sitetype <- do.call(rbind, future_lapply(range, function(b, x){
    myprog() ##Progress bar signaller
    myurl <- paste0(b, "/", x)
    cat("\n", myurl, " ")
    myret <- urlmode(myurl)
    cat(myret, "\n")
    return(c(myurl, myret))
  }, b = baseurl, future.chunk.size = 10))
  
})




#Converting into a proper data.frame
#and assigning column names
sitetype <- data.frame(sitetype)
names(sitetype) <- c("given_url", "actual_url", "mode")

#A bit of wrangling to tidy up the mode column
sitetype$mode <- stringr::str_replace(sitetype$mode, "0", "picture")


head(sitetype)
#                        given_url                     actual_url        mode
# 1 https://www.pixilink.com/93310 https://www.pixilink.com/93310     invalid
# 2 https://www.pixilink.com/93311 https://www.pixilink.com/93311     invalid
# 3 https://www.pixilink.com/93312 https://www.pixilink.com/93312 floorplan2d
# 4 https://www.pixilink.com/93313 https://www.pixilink.com/93313     picture
# 5 https://www.pixilink.com/93314 https://www.pixilink.com/93314 floorplan2d
# 6 https://www.pixilink.com/93315 https://www.pixilink.com/93315        tour

unique(sitetype$mode)
# [1] "invalid"     "floorplan2d" "picture"     "tour" 

#--------