解析R中的URL字符串_R - Fatal编程技术网

解析R中的URL字符串

解析R中的URL字符串,r,R,假设我有一系列的URL字符串，我已经导入到R url = c("http://www.mdd.com/food/pizza/index.html", "http://www.mdd.com/build-your-own/index.html", "http://www.mdd.com/special-deals.html", "http://www.mdd.com/find-a-location.html") 我想通过解析这些url来识别它们是什么页面。我希望能够将url[3]

假设我有一系列的URL字符串，我已经导入到R

url = c("http://www.mdd.com/food/pizza/index.html", "http://www.mdd.com/build-your-own/index.html",
        "http://www.mdd.com/special-deals.html", "http://www.mdd.com/find-a-location.html")

我想通过解析这些url来识别它们是什么页面。我希望能够将url[3]映射到特殊交易页面。对于本例，假设我有以下“类型”的页面

xtype = c("deals","find")
dtype = c("ingrediants","calories","chef")

给定这些类型，我想获取url变量并将它们映射到一起

因此，我应该以：

> df
                                           url  site
1     http://www.mdd.com/food/pizza/index.html dtype
2 http://www.mdd.com/build-your-own/index.html dtype
3        http://www.mdd.com/special-deals.html xtype
4      http://www.mdd.com/find-a-location.html xtype

我开始研究这个项目时认为我需要使用strsplit剥离每个url。但是，以下内容无法拆分url。拆分url将使我能够组合一些if-else语句来执行此任务。有效率的不，但只要能完成任务

Words = strsplit(as.character(url), " ")[[1]]
Words

以下是我的主要问题：

1. Is there a package to do URL parsing in R?
2. How can I go about identifying the page which is viewed from a large url string?

编辑：

我要问的是：如何从url字符串中找出“特定页面”。因此，如果我有“”，我想知道如何提取，只需构建自己的

现在还不清楚你的方向，但这里有一些解析URL的方法

使用

basename

功能

sapply(url, basename)
  http://www.mdd.com/food/pizza/index.html http://www.mdd.com/build-your-own/index.html 
                              "index.html"                                 "index.html" 
     http://www.mdd.com/special-deals.html      http://www.mdd.com/find-a-location.html 
                      "special-deals.html"                       "find-a-location.html"

使用前缀并

strsplit

prefix <- "http://www.mdd.com/"
unlist(strsplit(url, prefix))
[1] ""                          "food/pizza/index.html"     ""                         
[4] "build-your-own/index.html" ""                          "special-deals.html"       
[7] ""                          "find-a-location.html"

要查找您正在处理的url类型，可以使用

grep

xtype <- c("deals", "find")

> sapply(xtype, function(x) grep(x, url))

 deals  find 
     3     4

您可以使用

httr

包中的

parse_url

函数来解析url。正则表达式可用于提取相关子字符串：

sub("(.+?)[./].+", "\\1", sapply(url, function(x) parse_url(x)$path, 
                                 USE.NAMES = FALSE))

# [1] "food"            "build-your-own"  "special-deals"   "find-a-location"

现在还有

urltools

包，它比大多数其他方法快得多：

url <- c("http://www.mdd.com/food/pizza/index.html", 
         "http://www.mdd.com/build-your-own/index.html",
         "http://www.mdd.com/special-deals.html", 
         "http://www.mdd.com/find-a-location.html")

urltools::url_parse(url)

##   scheme      domain port                      path parameter fragment
## 1   http www.mdd.com          food/pizza/index.html                   
## 2   http www.mdd.com      build-your-own/index.html                   
## 3   http www.mdd.com             special-deals.html                   
## 4   http www.mdd.com           find-a-location.html

url我一点也不清楚你想做什么。我认为sapply（strsplit）（如.character（url），“\\”，[[”，1）
可能是您真正想要的。除了Tyler Rinker的评论之外，使用basename（url）
组织/索引页面可能更容易。http://www.mdd.com/special-deals.html？特价或特价。html
sub("(.+?)[./].+", "\\1", sapply(url, function(x) parse_url(x)$path, 
                                 USE.NAMES = FALSE))

# [1] "food"            "build-your-own"  "special-deals"   "find-a-location"

url <- c("http://www.mdd.com/food/pizza/index.html", 
         "http://www.mdd.com/build-your-own/index.html",
         "http://www.mdd.com/special-deals.html", 
         "http://www.mdd.com/find-a-location.html")

urltools::url_parse(url)

##   scheme      domain port                      path parameter fragment
## 1   http www.mdd.com          food/pizza/index.html                   
## 2   http www.mdd.com      build-your-own/index.html                   
## 3   http www.mdd.com             special-deals.html                   
## 4   http www.mdd.com           find-a-location.html