Regex 如何从谷歌搜索结果页面URL中提取关键词?

Regex 如何从谷歌搜索结果页面URL中提取关键词?,regex,r,url,Regex,R,Url,我的数据集中的一个变量包含谷歌搜索结果页面的URL。我想从这些URL中提取搜索关键字 示例数据集: keyw <- structure(list(user = structure(c(1L, 1L, 1L, 2L, 2L, 2L), .Label = c("p1", "p2"), class = "factor"), url = structure(c(3L, 5L, 4L, 1L, 2L, 6L), .Label = c("https://www.g

我的数据集中的一个变量包含谷歌搜索结果页面的URL。我想从这些URL中提取搜索关键字

示例数据集:

keyw <- structure(list(user = structure(c(1L, 1L, 1L, 2L, 2L, 2L), .Label = c("p1", "p2"), class = "factor"),
                   url = structure(c(3L, 5L, 4L, 1L, 2L, 6L), .Label = c("https://www.google.nl/search?q=five+fingers&ie=utf-8&oe=utf-8&gws_rd=cr,ssl&ei=kERoVbmMO6fp7AaGioCYAw", "https://www.google.nl/search?q=five+fingers&ie=utf-8&oe=utf-8&gws_rd=cr,ssl&ei=kERoVbmMO6fp7AaGioCYAw#safe=off&q=five+short+fingers+", "https://www.google.nl/search?q=high+five&ie=utf-8&oe=utf-8&gws_rd=cr,ssl&ei=bENoVZSqL4ON7Qb5wIDIDg", "https://www.google.nl/search?q=high+five&ie=utf-8&oe=utf-8&gws_rd=cr,ssl&ei=bENoVZSqL4ON7Qb5wIDIDg#safe=off&q=high+five+with+a+chair", "https://www.google.nl/search?q=high+five&ie=utf-8&oe=utf-8&gws_rd=cr,ssl&ei=bENoVZSqL4ON7Qb5wIDIDg#safe=off&q=high+five+with+handshake", "https://www.youtube.com/watch?v=6HOallAdtDI"), class = "factor")), 
              .Names = c("user", "url"), class = "data.frame", row.names = c(NA, -6L))
此输出有三个问题:

  • 我只需要这些词作为字符串。我需要的不是
    q=high+five
    ,而是
    high,five
  • 如第2行、第3行和第5行所示,URL有时包含两个带有搜索关键字的部分。由于第一部分只是对前面搜索的引用,因此我只需要第二个搜索查询
  • 当URL不是Google搜索页面URL时,它应该返回
    NA
  • 预期结果应为:

    > keyw$words
    [1] "high,five"                           
    [2] "high,five,with,handshake"
    [3] "high,five,with,a,chair"  
    [4] "five,fingers"                        
    [5] "five,short,fingers"
    [6] NA
    

    我该如何解决这个问题呢?

    必须有一个更干净的方法,但可能类似于:

    sapply(strsplit(keyw$words, "q="), function(x) {
      x <- if (length(x) == 2) x[2] else x[3]
      gsub("+", ",", gsub("\\+$", "", x), fixed = TRUE)
    })
    # [1] "high,five"                "high,five,with,handshake" "high,five,with,a,chair"  
    # [4] "five,fingers"             "five,short,fingers" 
    
    sapply(strsplit(keyw$words,“q=”),函数(x){
    x我想试试:

    x<-as.character(keyw$url)
    vapply(regmatches(x,gregexpr("(?<=q=)[^&]+",x,perl=TRUE)),
           function(y) paste(unique(unlist(strsplit(y,"\\+"))),collapse=","),"")
    #[1] "high,five"                "high,five,with,handshake"
    #[3] "high,five,with,a,chair"   "five,fingers"            
    #[5] "five,fingers,short"
    
    x更新(借用David的部分正则表达式):

    使用:

    pat <- "^https://www\\.google.(?:com?.)?[a-z]{2,3}/.*\\b?q=([^&]*[^&+]).*"
    
    产生:

    [1] "high,five"                "high,five,with,handshake" "high,five,with,a,chair"  
    [4] "five,fingers"             "five,short,fingers"       NA   
    
    [1] "high,five"                "high,five,with,handshake" "high,five,with,a,chair"  
    [4] "five,fingers"             "five,short,fingers,"     
    
    在这里,我们使用贪婪来确保跳过所有内容直到最后一部分,然后使用标准的
    子部分
    /
    \\1
    技巧来捕获我们想要的内容。最后,将
    +
    替换为

    或者这个

    gsub("\\+", ",", gsub(".*q=([^&#]*[^+&]).*", "\\1", keyw$url))
    # [1] "high,five"                "high,five,with,handshake" "high,five,with,a,chair"  
    # [4] "five,fingers"             "five,short,fingers"  
    

    评论后的另一个更新(看起来太复杂了,但这是我目前所能做到的最好的:):


    (查找
    (?嗯…。这与我在聊天室中提出的建议类似,但不处理
    end@DavidArenburg在过去的10分钟里没有出现在聊天室,是的,关于尾随逗号的观点很好。Thanx,尾随逗号需要删除。另请参阅我关于非谷歌的更新问题urls@Jaap,没问题,不过几乎可以肯定这是因为你有不同的谷歌网站。这只是特别匹配非常接近,也请参阅我在real上测试后更新的问题data@Jaap更新了答案,但我仍然想知道您希望通过
    http://whatever.com/search?q=my+搜索+单词
    应该是
    NA
    还是
    my,search,words
    ?有趣的一点。如果我想给出
    http://whatever.com/search?q=my+search+words
    an
    NA
    value?@Jaap我在纯正则表达式中看不到任何方法。我将在今天晚些时候尝试使用双通过滤器更新此演变的答案。非常感谢!对于此特定问题,您的answer给了我想要的结果,因为我的数据集中只有谷歌搜索页面。对于未来,我肯定需要一个能够过滤掉特定搜索的解决方案(例如:所有像
    http://google.com/search?q=my+搜索+单词
    或特定页面上的所有搜索,如
    http://whatever.com/search?q=my+搜索+单词
    )Thanx。这是可行的,但其他解决方案确实更干净。另请参阅我关于谷歌的最新问题urls@Jaap,如果我没有弄错的话,我的代码也会给你你期望的
    NA
    ——对吗?没错。我还在你的答案中包括了一个集成解决方案(可能会有所改进)。希望你不要介意。
    [1] "high,five"                "high,five,with,handshake" "high,five,with,a,chair"  
    [4] "five,fingers"             "five,short,fingers"       NA   
    
    pat <- "^https://www\\.google.(?:com?.)?[a-z]{2,3}/.*\\b?q=([^&]*[^&+]).*"
    
    gsub("\\+", ",", sub("^.*\\bq=([^&]*).*", "\\1", keyw$url))
    
    [1] "high,five"                "high,five,with,handshake" "high,five,with,a,chair"  
    [4] "five,fingers"             "five,short,fingers,"     
    
    gsub("\\+", ",", gsub(".*q=([^&#]*[^+&]).*", "\\1", keyw$url))
    # [1] "high,five"                "high,five,with,handshake" "high,five,with,a,chair"  
    # [4] "five,fingers"             "five,short,fingers"  
    
    keyw$words <- sapply(str_extract_all(str_extract(keyw$url,"https?:[/]{2}[^/]*google.*[/].*"),'(?<=q=|[+])([^$+#&]+)(?!.*q=)'),function(x) if(!length(x)) NA else paste(x,collapse=","))
    > keyw$words
    [1] "high,five"                "high,five,with,handshake" "high,five,with,a,chair"   "five,fingers"            
    [5] "five,short,fingers"       NA             
    
    > keyw$words <- sapply(str_extract_all(keyw$url,'(?:(?<=q=|\\+)([^$+#&]+)(?!.*q=))'),function(x) if(!length(x)) NA else paste(x,collapse=","))
    > keyw$words
    [1] "high,five"                "high,five,with,handshake" "high,five,with,a,chair"  
    [4] "five,fingers"             "five,short,fingers"       NA