Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/r/78.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
解析R中的URL字符串_R - Fatal编程技术网

解析R中的URL字符串

解析R中的URL字符串,r,R,假设我有一系列的URL字符串,我已经导入到R url = c("http://www.mdd.com/food/pizza/index.html", "http://www.mdd.com/build-your-own/index.html", "http://www.mdd.com/special-deals.html", "http://www.mdd.com/find-a-location.html") 我想通过解析这些url来识别它们是什么页面。我希望能够将url[3]

假设我有一系列的URL字符串,我已经导入到R

url = c("http://www.mdd.com/food/pizza/index.html", "http://www.mdd.com/build-your-own/index.html",
        "http://www.mdd.com/special-deals.html", "http://www.mdd.com/find-a-location.html")
我想通过解析这些url来识别它们是什么页面。我希望能够将url[3]映射到特殊交易页面。对于本例,假设我有以下“类型”的页面

xtype = c("deals","find")
dtype = c("ingrediants","calories","chef")
给定这些类型,我想获取url变量并将它们映射到一起

因此,我应该以:

> df
                                           url  site
1     http://www.mdd.com/food/pizza/index.html dtype
2 http://www.mdd.com/build-your-own/index.html dtype
3        http://www.mdd.com/special-deals.html xtype
4      http://www.mdd.com/find-a-location.html xtype
我开始研究这个项目时认为我需要使用strsplit剥离每个url。但是,以下内容无法拆分url。拆分url将使我能够组合一些if-else语句来执行此任务。有效率的不,但只要能完成任务

Words = strsplit(as.character(url), " ")[[1]]
Words
以下是我的主要问题:

1. Is there a package to do URL parsing in R?
2. How can I go about identifying the page which is viewed from a large url string?
编辑:


我要问的是:如何从url字符串中找出“特定页面”。因此,如果我有“”,我想知道如何提取,只需构建自己的

现在还不清楚你的方向,但这里有一些解析URL的方法

使用
basename
功能

sapply(url, basename)
  http://www.mdd.com/food/pizza/index.html http://www.mdd.com/build-your-own/index.html 
                              "index.html"                                 "index.html" 
     http://www.mdd.com/special-deals.html      http://www.mdd.com/find-a-location.html 
                      "special-deals.html"                       "find-a-location.html" 
使用前缀并
strsplit

prefix <- "http://www.mdd.com/"
unlist(strsplit(url, prefix))
[1] ""                          "food/pizza/index.html"     ""                         
[4] "build-your-own/index.html" ""                          "special-deals.html"       
[7] ""                          "find-a-location.html"  
要查找您正在处理的url类型,可以使用
grep

xtype <- c("deals", "find")

> sapply(xtype, function(x) grep(x, url))

 deals  find 
     3     4 

您可以使用
httr
包中的
parse_url
函数来解析url。正则表达式可用于提取相关子字符串:

sub("(.+?)[./].+", "\\1", sapply(url, function(x) parse_url(x)$path, 
                                 USE.NAMES = FALSE))

# [1] "food"            "build-your-own"  "special-deals"   "find-a-location"

现在还有
urltools
包,它比大多数其他方法快得多:

url <- c("http://www.mdd.com/food/pizza/index.html", 
         "http://www.mdd.com/build-your-own/index.html",
         "http://www.mdd.com/special-deals.html", 
         "http://www.mdd.com/find-a-location.html")

urltools::url_parse(url)

##   scheme      domain port                      path parameter fragment
## 1   http www.mdd.com          food/pizza/index.html                   
## 2   http www.mdd.com      build-your-own/index.html                   
## 3   http www.mdd.com             special-deals.html                   
## 4   http www.mdd.com           find-a-location.html                   

url我一点也不清楚你想做什么。我认为
sapply(strsplit)(如.character(url),“\\”,[[”,1)
可能是您真正想要的。除了Tyler Rinker的评论之外,使用
basename(url)
组织/索引页面可能更容易。
http://www.mdd.com/special-deals.html
?特价或特价。html
sub("(.+?)[./].+", "\\1", sapply(url, function(x) parse_url(x)$path, 
                                 USE.NAMES = FALSE))

# [1] "food"            "build-your-own"  "special-deals"   "find-a-location"
url <- c("http://www.mdd.com/food/pizza/index.html", 
         "http://www.mdd.com/build-your-own/index.html",
         "http://www.mdd.com/special-deals.html", 
         "http://www.mdd.com/find-a-location.html")

urltools::url_parse(url)

##   scheme      domain port                      path parameter fragment
## 1   http www.mdd.com          food/pizza/index.html                   
## 2   http www.mdd.com      build-your-own/index.html                   
## 3   http www.mdd.com             special-deals.html                   
## 4   http www.mdd.com           find-a-location.html