解析R中的URL字符串
假设我有一系列的URL字符串,我已经导入到R解析R中的URL字符串,r,R,假设我有一系列的URL字符串,我已经导入到R url = c("http://www.mdd.com/food/pizza/index.html", "http://www.mdd.com/build-your-own/index.html", "http://www.mdd.com/special-deals.html", "http://www.mdd.com/find-a-location.html") 我想通过解析这些url来识别它们是什么页面。我希望能够将url[3]
url = c("http://www.mdd.com/food/pizza/index.html", "http://www.mdd.com/build-your-own/index.html",
"http://www.mdd.com/special-deals.html", "http://www.mdd.com/find-a-location.html")
我想通过解析这些url来识别它们是什么页面。我希望能够将url[3]映射到特殊交易页面。对于本例,假设我有以下“类型”的页面
xtype = c("deals","find")
dtype = c("ingrediants","calories","chef")
给定这些类型,我想获取url变量并将它们映射到一起
因此,我应该以:
> df
url site
1 http://www.mdd.com/food/pizza/index.html dtype
2 http://www.mdd.com/build-your-own/index.html dtype
3 http://www.mdd.com/special-deals.html xtype
4 http://www.mdd.com/find-a-location.html xtype
我开始研究这个项目时认为我需要使用strsplit剥离每个url。但是,以下内容无法拆分url。拆分url将使我能够组合一些if-else语句来执行此任务。有效率的不,但只要能完成任务
Words = strsplit(as.character(url), " ")[[1]]
Words
以下是我的主要问题:
1. Is there a package to do URL parsing in R?
2. How can I go about identifying the page which is viewed from a large url string?
编辑:
我要问的是:如何从url字符串中找出“特定页面”。因此,如果我有“”,我想知道如何提取,只需构建自己的 现在还不清楚你的方向,但这里有一些解析URL的方法 使用
basename
功能
sapply(url, basename)
http://www.mdd.com/food/pizza/index.html http://www.mdd.com/build-your-own/index.html
"index.html" "index.html"
http://www.mdd.com/special-deals.html http://www.mdd.com/find-a-location.html
"special-deals.html" "find-a-location.html"
使用前缀并strsplit
prefix <- "http://www.mdd.com/"
unlist(strsplit(url, prefix))
[1] "" "food/pizza/index.html" ""
[4] "build-your-own/index.html" "" "special-deals.html"
[7] "" "find-a-location.html"
要查找您正在处理的url类型,可以使用grep
xtype <- c("deals", "find")
> sapply(xtype, function(x) grep(x, url))
deals find
3 4
您可以使用
httr
包中的parse_url
函数来解析url。正则表达式可用于提取相关子字符串:
sub("(.+?)[./].+", "\\1", sapply(url, function(x) parse_url(x)$path,
USE.NAMES = FALSE))
# [1] "food" "build-your-own" "special-deals" "find-a-location"
现在还有
urltools
包,它比大多数其他方法快得多:
url <- c("http://www.mdd.com/food/pizza/index.html",
"http://www.mdd.com/build-your-own/index.html",
"http://www.mdd.com/special-deals.html",
"http://www.mdd.com/find-a-location.html")
urltools::url_parse(url)
## scheme domain port path parameter fragment
## 1 http www.mdd.com food/pizza/index.html
## 2 http www.mdd.com build-your-own/index.html
## 3 http www.mdd.com special-deals.html
## 4 http www.mdd.com find-a-location.html
url我一点也不清楚你想做什么。我认为sapply(strsplit)(如.character(url),“\\”,[[”,1)
可能是您真正想要的。除了Tyler Rinker的评论之外,使用basename(url)
组织/索引页面可能更容易。http://www.mdd.com/special-deals.html
?特价或特价。html
sub("(.+?)[./].+", "\\1", sapply(url, function(x) parse_url(x)$path,
USE.NAMES = FALSE))
# [1] "food" "build-your-own" "special-deals" "find-a-location"
url <- c("http://www.mdd.com/food/pizza/index.html",
"http://www.mdd.com/build-your-own/index.html",
"http://www.mdd.com/special-deals.html",
"http://www.mdd.com/find-a-location.html")
urltools::url_parse(url)
## scheme domain port path parameter fragment
## 1 http www.mdd.com food/pizza/index.html
## 2 http www.mdd.com build-your-own/index.html
## 3 http www.mdd.com special-deals.html
## 4 http www.mdd.com find-a-location.html