R I'；我想弄清楚谷歌新闻是否被允许上网_R_Web Scraping

R I'；我想弄清楚谷歌新闻是否被允许上网

r web-scraping

R I'；我想弄清楚谷歌新闻是否被允许上网,r,web-scraping,R,Web Scraping,我正在使用软件包中的允许路径函数robotstxt 在我的例子中，我想知道是否可以从某个特定的网站上抓取数据，但每次我这样做都会出错 library(robotstxt) paths_allowed(paths = "https://news.google.com/?hl=en-IN&gl=IN&ceid=IN%3Aen") 错误消息如下所示： news.google.com Error in if (is_http) { : argu

我正在使用软件包中的允许路径函数robotstxt 在我的例子中，我想知道是否可以从某个特定的网站上抓取数据，但每次我这样做都会出错

library(robotstxt)
paths_allowed(paths = "https://news.google.com/?hl=en-IN&gl=IN&ceid=IN%3Aen")

错误消息如下所示：

news.google.com                      Error in if (is_http) { : argument is of length zero

谢谢。

只需使用

httr

软件包并将

GET

请求发送到

https://news.google.com/robots.txt

要获取我们需要的信息：

a <- httr::GET("https://news.google.com/robots.txt")
httr::content(a)
User-agent: *
Disallow: /
Disallow: /search?
Allow: /$
Allow: /?
Allow: /nwshp$
Allow: /news$
Allow: /news/$
Allow: /news/?gl=
Allow: /news/?hl=
Allow: /news/?ned=
Allow: /about$
Allow: /about?
Allow: /about/
Allow: /topics/
Allow: /publications/
Allow: /stories/
Allow: /swg/

User-agent: Googlebot
Disallow: /topics/
Disallow: /publications/
Disallow: /stories/

a只需使用httr
包并将GET
请求发布到https://news.google.com/robots.txt
要获取我们需要的信息：
a <- httr::GET("https://news.google.com/robots.txt")
httr::content(a)
User-agent: *
Disallow: /
Disallow: /search?
Allow: /$
Allow: /?
Allow: /nwshp$
Allow: /news$
Allow: /news/$
Allow: /news/?gl=
Allow: /news/?hl=
Allow: /news/?ned=
Allow: /about$
Allow: /about?
Allow: /about/
Allow: /topics/
Allow: /publications/
Allow: /stories/
Allow: /swg/

User-agent: Googlebot
Disallow: /topics/
Disallow: /publications/
Disallow: /stories/

a感谢您的回答，但是我的代码中是否存在逻辑错误。因为我在Youtube视频中看到了同样的代码，他没有面临任何问题。嗯。。。尝试path=c（url）
将您的url放入c（）中。。。我还没有使用过那个软件包，所以不太确定。谢谢你的回答，但是我的代码中有逻辑错误吗。因为我在Youtube视频中看到了同样的代码，他没有面临任何问题。嗯。。。尝试path=c（url）
将您的url放入c（）中。。。我没用过那个软件包，所以不太确定。