Python 检查网站是否根据其URL中的模式提供照片或视频_Python_R_String_Url

Python 检查网站是否根据其URL中的模式提供照片或视频

python r string url

Python 检查网站是否根据其URL中的模式提供照片或视频,python,r,string,url,Python,R,String,Url,我想知道如何通过检查网址来判断一个网站是否提供照片或视频。我调查了我感兴趣的网站，发现我拥有的大多数链接都是这样的：（我不确定我是否能真正命名该网站，所以现在我只是以一个示例的形式编写）：其中，example是主域，abcdef是一个类似69964的数字。我发现一个有趣的模式是，在输入这个URL后，如果它确实有视频，URL将自动更改为https://www.example.com/abcdef#mode=tour如果它只是一张照片，它将变为https://www.example.com/abc

我想知道如何通过检查网址来判断一个网站是否提供照片或视频。我调查了我感兴趣的网站，发现我拥有的大多数链接都是这样的：（我不确定我是否能真正命名该网站，所以现在我只是以一个示例的形式编写）：

其中，example是主域，abcdef是一个类似69964的数字。我发现一个有趣的模式是，在输入这个URL后，如果它确实有视频，URL将自动更改为

https://www.example.com/abcdef#mode=tour

如果它只是一张照片，它将变为

https://www.example.com/abcdef#mode=0

现在我有一个来自这个网站的URL列表，我只想检查它是否有照片或视频，或者它不工作（无效URL）。还有什么办法吗？

所以我有一个非常简单的解决方案

检查OP提供的URL（例如，

https://www.pixilink.com/93313

）表示

#mode=

默认值由嵌入javascript中的变量

initial_mode=

提供。因此，要确定URL是否默认为“图片”（

#mode=0

）或视频（

#mode=tour

），可以通过调查分配给此变量的值来完成

#Function to get the value of initial_mode from the URL
urlmode <- function(x){
  mycontent <- readLines(x)
  mypos <- grep("initial_mode = ", mycontent)
  
  if(grepl("0", mycontent[mypos])){
    cat("\n", x, "has default initial_mode picture: #mode=0 \n")
    return("picture")
  } else if(grepl("tour", mycontent[mypos])){
    cat("\n", x, "has default initial_mode video: #mode=tour \n")
    return("video")
  } else{
    cat("\n", x, "is an invalid URL. \n")
    return("invalid")
  }
}


#Example URLs to demonstrate functionality
myurl1 <- "https://www.pixilink.com/93313"
myurl2 <- "https://www.pixilink.com/69964"


urlmode(myurl1)
#
# https://www.pixilink.com/93313 has default initial_mode picture: #mode=0 
#[1] "picture"
#Warning message:
#In readLines(x) :
#  incomplete final line found on 'https://www.pixilink.com/93313'
#

urlmode(myurl2)
#
# https://www.pixilink.com/69964 has default initial_mode video: #mode=tour 
#[1] "video"
#Warning message:
#In readLines(x) :
#  incomplete final line found on 'https://www.pixilink.com/69964'

#函数从URL获取初始_模式的值
urlmodeifelse（stringr:：str_detect（url，“tour$”），“是视频”，“是图片”）
？是的，我想这是正确的，但我仍然有一个问题。我的主URL是http://www.example.com/abcdef
我必须打开浏览器并手动输入此URL，然后URL将更改为https://www.example.com/abcdef#mode=tour
或https://www.example.com/abcdef#mode=0
。因此，我们在这里仍然有一个问题，那就是如何找到URL是否位于from ofhttps://www.example.com/abcdef#mode=0
或https://www.example.com/abcdef#mode=tour
您是否可以提供一个示例URL，人们可以在其中进行实际测试？@Dunois，您可以检查http://www.pixilink.com/69964
作为示例。一旦您在浏览器中输入此项，它将变为https://www.pixilink.com/69964#mode=tour
这表明它有视频。您还可以提供一个#mode=0
的示例吗？哇，谢谢。为了便于参考，我可以问一下，您实际上是在检查初始\u模式
？据我理解，差异来自于这个变量；但是，我在代码中没有看到它。还有，有没有办法改进代码，让它首先检查URL是否正常工作？因此，readLines（）
获取URL的内容，grep（）
ing forinitial\u mode=
查找initial\u mode
所在的行号，以及grepl（）if/else
链中的
s检查该特定行上的是0
（图片）还是tour
。如果您想检查URL是否有效，可以将httr:：GET（）
或类似的东西包装在tryCatch（）
中作为函数的第一行（或者将函数包装在tryCatch（）
）中。这里：mypos我们可以在一天内处理有限数量的URL吗？我问这个问题的原因是，当我将它与tryCatch结合在100个示例URL上时，代码运行得非常完美，但在我运行了4500个URL之后，我收到了错误消息：“文件中的错误（con，“r”）：所有连接都在使用中，现在我开始怀疑我是否可以在一天内处理有限数量的URL。对吗？或者这里还有另一个问题？它没有限制，至少在这个上下文中不是从用户的角度。您可能需要在每次函数调用后调用closeAllConnections（）
，因为您可能已经达到了并发打开函数的数量限制。我不知道您的代码是什么样子的，但是如果您正确设置了tryCatch（），我想这是唯一的问题。
#Function to get the value of initial_mode from the URL
urlmode <- function(x){
  mycontent <- readLines(x)
  mypos <- grep("initial_mode = ", mycontent)
  
  if(grepl("0", mycontent[mypos])){
    cat("\n", x, "has default initial_mode picture: #mode=0 \n")
    return("picture")
  } else if(grepl("tour", mycontent[mypos])){
    cat("\n", x, "has default initial_mode video: #mode=tour \n")
    return("video")
  } else{
    cat("\n", x, "is an invalid URL. \n")
    return("invalid")
  }
}


#Example URLs to demonstrate functionality
myurl1 <- "https://www.pixilink.com/93313"
myurl2 <- "https://www.pixilink.com/69964"


urlmode(myurl1)
#
# https://www.pixilink.com/93313 has default initial_mode picture: #mode=0 
#[1] "picture"
#Warning message:
#In readLines(x) :
#  incomplete final line found on 'https://www.pixilink.com/93313'
#

urlmode(myurl2)
#
# https://www.pixilink.com/69964 has default initial_mode video: #mode=tour 
#[1] "video"
#Warning message:
#In readLines(x) :
#  incomplete final line found on 'https://www.pixilink.com/69964'