如何正确关闭连接,使我赢得';“得不到”;文件(con,“r”中出现错误:所有连接都在使用中;当使用;“阅读资料”;及;“tryCatch”;

如何正确关闭连接,使我赢得';“得不到”;文件(con,“r”中出现错误:所有连接都在使用中;当使用;“阅读资料”;及;“tryCatch”;,r,url,try-catch,readlines,R,Url,Try Catch,Readlines,我有一个特定域(pixilink.com)的URL列表(超过4000个),我想做的是确定提供的域是图片还是视频。为此,我使用了这里提供的解决方案:并编写了如下代码: #Function to get the value of initial_mode from the URL urlmode <- function(x){ mycontent <- readLines(x) mypos <- grep("initial_mode = ", mycon

我有一个特定域(pixilink.com)的URL列表(超过4000个),我想做的是确定提供的域是图片还是视频。为此,我使用了这里提供的解决方案:并编写了如下代码:

#Function to get the value of initial_mode from the URL
urlmode <- function(x){
  mycontent <- readLines(x)
  mypos <- grep("initial_mode = ", mycontent)
  
  if(grepl("0", mycontent[mypos])){
    return("picture")
  } else if(grepl("tour", mycontent[mypos])){
    return("video")
  } else{
    return(NA)
  }
}
#函数从URL获取初始_模式的值

urlmode因此,我猜测发生这种情况的原因是,您没有关闭通过
tryCatch()
和使用
readLines()
通过
urlmode()
打开的连接。我不确定
urlmode()
将如何在中使用,因此我尽可能地简化了它(事后看来,这做得很糟糕,我深表歉意)。因此,我冒昧地重写了
urlmode()
,试图让它更健壮一点,以完成手头似乎更为庞大的任务

我认为代码中的注释应该会有所帮助,请看下面的内容:

#Updated URL mode function with better 
#URL checking, connection handling,
#and "mode" investigation
urlmode <- function(x){
  
  #Check if URL is good to go
  if(!httr::http_error(x)){
    
    #Test cases
    #x <- "www.pixilink.com/3"
    #x <- "https://www.pixilink.com/93320"
    #x <- "https://www.pixilink.com/93313"
    
    #Then since there are redirect shenanigans
    #Get the actual URL the input points to
    #It should just be the input URL if there is
    #no redirection
    #This is important as this also takes care of
    #checking whether http or https need to be prefixed
    #in case the input URL is supplied without those
    #(this can cause problems for url() below)
    myx <- httr::HEAD(x)$url
    
    #Then check for what the default mode is
    mycon <- url(myx)
    open(mycon, "r")
    mycontent <- readLines(mycon)
    
    mypos <- grep("initial_mode = ", mycontent)
    
    #Close the connection since it's no longer
    #necessary
    close(mycon)
    
    #Some URLs with weird formats can return 
    #empty on this one since they don't
    #follow the expected format.
    #See for example: "https://www.pixilink.com/clients/899/#3"
    #which is actually
    #redirected from "https://www.pixilink.com/3"
    #After that, evaluate what's at mypos, and always 
    #return the actual URL
    #along with the result
    if(!purrr::is_empty(mypos)){
      
      #mystr<- stringr::str_extract(mycontent[mypos], "(?<=initial_mode\\s\\=).*")
      mystr <- stringr::str_extract(mycontent[mypos], "(?<=\').*(?=\')")
      return(c(myx, mystr))
      #return(mystr)
      
      #So once all that is done, check if the line at mypos
      #contains a 0 (picture), tour (video)
      #if(grepl("0", mycontent[mypos])){
      #  return(c(myx, "picture"))
        #return("picture")
      #} else if(grepl("tour", mycontent[mypos])){
      #  return(c(myx, "video"))
        #return("video")
      #}
      
    } else{
      #Valid URL but not interpretable
      return(c(myx, "uninterpretable"))
      #return("uninterpretable")
    }
    
  } else{
    #Straight up invalid URL
    #No myx variable to return here
    #Just x
    return(c(x, "invalid"))
    #return("invalid")
  }
  
}


#--------
#Sample code execution
library(purrr)
library(parallel)
library(future.apply)
library(httr)
library(stringr)
library(progressr)
library(progress)


#All future + progressr related stuff
#learned courtesy 
#https://stackoverflow.com/a/62946400/9494044
#Setting up parallelized execution
no_cores <- parallel::detectCores()
#The above setup will ensure ALL cores
#are put to use
clust <- parallel::makeCluster(no_cores)
future::plan(cluster, workers = clust)

#Progress bar for sanity checking
progressr::handlers(progressr::handler_progress(format="[:bar] :percent :eta :message"))


#Website's base URL
baseurl <- "https://www.pixilink.com"

#Using future_lapply() to recursively apply urlmode()
#to a sequence of the URLs on pixilink in parallel
#and storing the results in sitetype
#Using a future chunk size of 10
#Everything is wrapped in with_progress() to enable the
#progress bar

#
range <- 93310:93350
#range <- 1:10000
progressr::with_progress({
  myprog <- progressr::progressor(along = range)
  sitetype <- do.call(rbind, future_lapply(range, function(b, x){
    myprog() ##Progress bar signaller
    myurl <- paste0(b, "/", x)
    cat("\n", myurl, " ")
    myret <- urlmode(myurl)
    cat(myret, "\n")
    return(c(myurl, myret))
  }, b = baseurl, future.chunk.size = 10))
  
})




#Converting into a proper data.frame
#and assigning column names
sitetype <- data.frame(sitetype)
names(sitetype) <- c("given_url", "actual_url", "mode")

#A bit of wrangling to tidy up the mode column
sitetype$mode <- stringr::str_replace(sitetype$mode, "0", "picture")


head(sitetype)
#                        given_url                     actual_url        mode
# 1 https://www.pixilink.com/93310 https://www.pixilink.com/93310     invalid
# 2 https://www.pixilink.com/93311 https://www.pixilink.com/93311     invalid
# 3 https://www.pixilink.com/93312 https://www.pixilink.com/93312 floorplan2d
# 4 https://www.pixilink.com/93313 https://www.pixilink.com/93313     picture
# 5 https://www.pixilink.com/93314 https://www.pixilink.com/93314 floorplan2d
# 6 https://www.pixilink.com/93315 https://www.pixilink.com/93315        tour

unique(sitetype$mode)
# [1] "invalid"     "floorplan2d" "picture"     "tour" 

#--------
#更新了URL模式功能,具有更好的
#URL检查、连接处理、,
#和“模式”调查

urlmode为什么不在使用完连接后关闭它们?@IRTFM,我已经用
closeAllconnection()
尝试过了,但仍然收到了相同的错误消息感谢Dunios为此付出的所有努力。我有几个问题,但我宁愿从我收到的“错误”信息开始。为了完全理解您的代码,我从一个示例“www.pixilink.com/3”开始。问题是,当我想为此链接运行
open(mycon,“r”)
时,我在open.connection(mycon,“r”):无法打开连接时收到一条错误消息
error。不过,对于其他例子,我没有这个错误。然后我跳过了这一步,为我的url列表运行了代码,并再次在232rd url上运行了代码,该url是
https://www.pixilink.com/141451#mode=tour
我收到了相同的错误消息,该错误消息说,
打开连接时出错(mycon,“r”):无法打开连接
。这种联系实际上是存在的,我不知道为什么会发生这种情况。对此有何评论?我现在绝对可以跳过这些变量,但我只是想再次完全理解代码我想说谢谢你的努力,我知道你应该有完整的数据集来解决所有问题。我在这里共享了它,以便您可以检查并更好地了解我的数据集中发生了什么。希望您能帮助我解决此问题。此类错误的另一个示例是此链接
https://www.pixilink.com/140079#mode=tour
,当我运行它时,会出现如下错误:
在open.connection(mycon,“r”)中出错:无法打开连接另外:警告消息:在打开的连接中(mycon,“r”):无法打开URL'https://www.pixilink.com/140079#mode=tour“:HTTP状态为“400错误请求”
此链接也存在,我不确定为什么R无法与该链接建立连接。不,我在您提供的测试URL上运行了完全相同的函数。是的,很奇怪!我在StackOverflow上问过这个问题,其他人提到他在Linux上运行了它,他没有得到任何错误!您是否也在使用
Linux
?我使用的是Windows10,也许这就是原因。为了使代码为我工作,我必须分离所有这些链接,这导致连接错误(47个链接),然后代码为我完美运行
a <- subset(new_df, new_df$host=="www.pixilink.com")
vec <- a[['V']]
vec <- vec[1:1000] # only chose first 1000 rows

tt <- numeric(length(vec)) # checking validity of url
for (i in 1:length(vec)){
  tt[i] <- readUrl(vec[i])
  print(i)
}    
g <- data.frame(vec,tt)
g2 <- g[which(!is.na(g$tt)),] #only valid url

dd <- numeric(nrow(g2))
for (j in 1:nrow(g2)){
  dd[j] <- urlmode(g2[j,1])      
}    
Final <- cbind(g2,dd)
Final <- left_join(g, Final, by = c("vec" = "vec"))
#Updated URL mode function with better 
#URL checking, connection handling,
#and "mode" investigation
urlmode <- function(x){
  
  #Check if URL is good to go
  if(!httr::http_error(x)){
    
    #Test cases
    #x <- "www.pixilink.com/3"
    #x <- "https://www.pixilink.com/93320"
    #x <- "https://www.pixilink.com/93313"
    
    #Then since there are redirect shenanigans
    #Get the actual URL the input points to
    #It should just be the input URL if there is
    #no redirection
    #This is important as this also takes care of
    #checking whether http or https need to be prefixed
    #in case the input URL is supplied without those
    #(this can cause problems for url() below)
    myx <- httr::HEAD(x)$url
    
    #Then check for what the default mode is
    mycon <- url(myx)
    open(mycon, "r")
    mycontent <- readLines(mycon)
    
    mypos <- grep("initial_mode = ", mycontent)
    
    #Close the connection since it's no longer
    #necessary
    close(mycon)
    
    #Some URLs with weird formats can return 
    #empty on this one since they don't
    #follow the expected format.
    #See for example: "https://www.pixilink.com/clients/899/#3"
    #which is actually
    #redirected from "https://www.pixilink.com/3"
    #After that, evaluate what's at mypos, and always 
    #return the actual URL
    #along with the result
    if(!purrr::is_empty(mypos)){
      
      #mystr<- stringr::str_extract(mycontent[mypos], "(?<=initial_mode\\s\\=).*")
      mystr <- stringr::str_extract(mycontent[mypos], "(?<=\').*(?=\')")
      return(c(myx, mystr))
      #return(mystr)
      
      #So once all that is done, check if the line at mypos
      #contains a 0 (picture), tour (video)
      #if(grepl("0", mycontent[mypos])){
      #  return(c(myx, "picture"))
        #return("picture")
      #} else if(grepl("tour", mycontent[mypos])){
      #  return(c(myx, "video"))
        #return("video")
      #}
      
    } else{
      #Valid URL but not interpretable
      return(c(myx, "uninterpretable"))
      #return("uninterpretable")
    }
    
  } else{
    #Straight up invalid URL
    #No myx variable to return here
    #Just x
    return(c(x, "invalid"))
    #return("invalid")
  }
  
}


#--------
#Sample code execution
library(purrr)
library(parallel)
library(future.apply)
library(httr)
library(stringr)
library(progressr)
library(progress)


#All future + progressr related stuff
#learned courtesy 
#https://stackoverflow.com/a/62946400/9494044
#Setting up parallelized execution
no_cores <- parallel::detectCores()
#The above setup will ensure ALL cores
#are put to use
clust <- parallel::makeCluster(no_cores)
future::plan(cluster, workers = clust)

#Progress bar for sanity checking
progressr::handlers(progressr::handler_progress(format="[:bar] :percent :eta :message"))


#Website's base URL
baseurl <- "https://www.pixilink.com"

#Using future_lapply() to recursively apply urlmode()
#to a sequence of the URLs on pixilink in parallel
#and storing the results in sitetype
#Using a future chunk size of 10
#Everything is wrapped in with_progress() to enable the
#progress bar

#
range <- 93310:93350
#range <- 1:10000
progressr::with_progress({
  myprog <- progressr::progressor(along = range)
  sitetype <- do.call(rbind, future_lapply(range, function(b, x){
    myprog() ##Progress bar signaller
    myurl <- paste0(b, "/", x)
    cat("\n", myurl, " ")
    myret <- urlmode(myurl)
    cat(myret, "\n")
    return(c(myurl, myret))
  }, b = baseurl, future.chunk.size = 10))
  
})




#Converting into a proper data.frame
#and assigning column names
sitetype <- data.frame(sitetype)
names(sitetype) <- c("given_url", "actual_url", "mode")

#A bit of wrangling to tidy up the mode column
sitetype$mode <- stringr::str_replace(sitetype$mode, "0", "picture")


head(sitetype)
#                        given_url                     actual_url        mode
# 1 https://www.pixilink.com/93310 https://www.pixilink.com/93310     invalid
# 2 https://www.pixilink.com/93311 https://www.pixilink.com/93311     invalid
# 3 https://www.pixilink.com/93312 https://www.pixilink.com/93312 floorplan2d
# 4 https://www.pixilink.com/93313 https://www.pixilink.com/93313     picture
# 5 https://www.pixilink.com/93314 https://www.pixilink.com/93314 floorplan2d
# 6 https://www.pixilink.com/93315 https://www.pixilink.com/93315        tour

unique(sitetype$mode)
# [1] "invalid"     "floorplan2d" "picture"     "tour" 

#--------