R下载的源代码与网站源代码的差异
我正在抓取一个网站,提取一些产品的信息,但我在价格方面有问题。我的代码如下:R下载的源代码与网站源代码的差异,r,web-scraping,xml2,R,Web Scraping,Xml2,我正在抓取一个网站,提取一些产品的信息,但我在价格方面有问题。我的代码如下: > enlace<-"http://www.carulla.com/products/0000687608965009/Crema+Dental+Sensitive+Proalivio+Colgate" > download.file(enlace, destfile = "scrapedpage.html", quiet=TRUE) > doc<-read_html("scrapedpa
> enlace<-"http://www.carulla.com/products/0000687608965009/Crema+Dental+Sensitive+Proalivio+Colgate"
> download.file(enlace, destfile = "scrapedpage.html", quiet=TRUE)
> doc<-read_html("scrapedpage.html")
> # description
> toString(xml_find_all(doc,xpath=paste0('//*[@id="pdpProduct"]/div[3]/h3')))
[1] "<h3 class=\"pdpInfoProductName\" itemprop=\"name\">Crema Dental Sensitive Proalivio Colgate</h3>"
> # reference
> toString(xml_find_all(doc,xpath=paste0('//*[@id="pdpProduct"]/div[3]/p')))
[1] "<p class=\"pdpInfoProductRef\">\r\n\t\t\t\t\t\t\t\t\tPresentación:C \r\n\t\t\t\t\t\t\t\t\tPLU:739983</p>"
> # prices
> toString(xml_find_all(doc,xpath=paste0('//*[@id="pdpProduct"]/div[3]/div[1]/div[2]/h4')))
[1] ""
> con2<-url(enlace,"r")
> x<-readLines(con2)
> close(con2)
> x[1270:1285]
[1] "\t\t\t\t\t\t\t\t\tPLU:739983</p>"
[2] "\t\t\t\t\t\t\t<div class=\"pdpInfoProductPrices\">\t"
[3] "\t\t\t\t\t<div class=\"pdpInfoProductPrice\" itemprop=\"offers\" itemscope itemtype=\"http://schema.org/Offer\">"
[4] "\t\t\t\t\t"
[5] "\t\t\t\t\t<meta itemprop=\"priceCurrency\" content=\"COP\" />"
[6] " <meta itemprop=\"price\" content=\"\" />"
[7] "\t\t\t\t\t\t<h4 class=\"price\">"
[8] "\t\t\t\t\t\t\t</h4>"
[9] "\t\t\t\t\t\t</div>"
[10] "\t\t\t\t</div>"
[11] "\t\t\t\t"
[12] "\t\t\t\t\t\t\t\t\t"
[13] "\t\t\t\t\t\t\t\t\t\t\t\t\t <div class=\"product-seller row-fluid\">"
[14] "\t\t\t\t <!-- +++++ Carulla Seller +++++ --> "
[15] " <p> Vendido por:   Carulla</p> "
[16] " </div>"
>enlace download.file(enlace,destfile=“scrapedpage.html”,quiet=TRUE)
>文件描述
>toString(xml_find_all(doc,xpath=paste0('/*[@id=“pdpProduct”]/div[3]/h3'))
[1] “Crema牙科敏感高露洁Proalivio”
>#参考
>toString(xml_find_all(doc,xpath=paste0('/*[@id=“pdpProduct”]/div[3]/p'))
[1] “\r\n\t\t\t\t\t\t\t\tPresentación:C\r\n\t\t\t\t\t\t\t\tPLU:739983
”
>#价格
>toString(xml_find_all(doc,xpath=paste0('/*[@id=“pdpProduct”]/div[3]/div[1]/div[2]/h4'))
[1] ""
我在原始页面的源代码中检查了这些信息,在那里我发现:
<div class="pdpInfoProduct pull-left">
<h3 class="pdpInfoProductName" itemprop="name">Crema Dental Sensitive Proalivio Colgate</h3>
<h2 class="pdpInfoProductBrand" itemprop="brand">COLGATE</h2>
<p class="pdpInfoProductRef">
Presentación:C
PLU:739983</p>
<div class="pdpInfoProductPrices">
<div class="pull-right">
<div class="pro-big-Ovalo">
<p>25%</p>
</div>
</div>
<div class="pdpInfoProductPrice" itemprop="offers" itemscope itemtype="http://schema.org/Offer">
<meta itemprop="priceCurrency" content="COP" />
<meta itemprop="price" content="17213.0" />
<h4 class="priceOffer">
$17.213</h4>
<h6 class="before">Antes: <span class="strikeText">
$22.950</span>
</h6>
</div>
</div>
Crema牙科敏感型高露洁Proalivio
高露洁
Presentación:C
PLU:739983
25%
$17.213
赌注:
$22.950
我感兴趣的信息是17.213美元,但当我尝试下载带有R的源代码时,我得到以下信息:
> enlace<-"http://www.carulla.com/products/0000687608965009/Crema+Dental+Sensitive+Proalivio+Colgate"
> download.file(enlace, destfile = "scrapedpage.html", quiet=TRUE)
> doc<-read_html("scrapedpage.html")
> # description
> toString(xml_find_all(doc,xpath=paste0('//*[@id="pdpProduct"]/div[3]/h3')))
[1] "<h3 class=\"pdpInfoProductName\" itemprop=\"name\">Crema Dental Sensitive Proalivio Colgate</h3>"
> # reference
> toString(xml_find_all(doc,xpath=paste0('//*[@id="pdpProduct"]/div[3]/p')))
[1] "<p class=\"pdpInfoProductRef\">\r\n\t\t\t\t\t\t\t\t\tPresentación:C \r\n\t\t\t\t\t\t\t\t\tPLU:739983</p>"
> # prices
> toString(xml_find_all(doc,xpath=paste0('//*[@id="pdpProduct"]/div[3]/div[1]/div[2]/h4')))
[1] ""
> con2<-url(enlace,"r")
> x<-readLines(con2)
> close(con2)
> x[1270:1285]
[1] "\t\t\t\t\t\t\t\t\tPLU:739983</p>"
[2] "\t\t\t\t\t\t\t<div class=\"pdpInfoProductPrices\">\t"
[3] "\t\t\t\t\t<div class=\"pdpInfoProductPrice\" itemprop=\"offers\" itemscope itemtype=\"http://schema.org/Offer\">"
[4] "\t\t\t\t\t"
[5] "\t\t\t\t\t<meta itemprop=\"priceCurrency\" content=\"COP\" />"
[6] " <meta itemprop=\"price\" content=\"\" />"
[7] "\t\t\t\t\t\t<h4 class=\"price\">"
[8] "\t\t\t\t\t\t\t</h4>"
[9] "\t\t\t\t\t\t</div>"
[10] "\t\t\t\t</div>"
[11] "\t\t\t\t"
[12] "\t\t\t\t\t\t\t\t\t"
[13] "\t\t\t\t\t\t\t\t\t\t\t\t\t <div class=\"product-seller row-fluid\">"
[14] "\t\t\t\t <!-- +++++ Carulla Seller +++++ --> "
[15] " <p> Vendido por:   Carulla</p> "
[16] " </div>"
>con2 x关闭(con2)
>x[1270:1285]
[1] “\t\t\t\t\t\t\t\t\t\tPLU:739983”
[2] “\t\t\t\t\t\t\t\t\t”
[3] “\t\t\t\t\t”
[4] “\t\t\t\t\t”
[5] “\t\t\t\t\t”
[6] " "
[7] “\t\t\t\t\t\t”
[8] “\t\t\t\t\t\t\t\t”
[9] “\t\t\t\t\t\t”
[10] “\t\t\t\t”
[11] “\t\t\t\t”
[12] “\t\t\t\t\t\t\t\t\t\t”
[13] “\t\t\t\t\t\t\t\t\t\t\t\t\t\t”
[14] “\t\t\t\t”
[15] “Vendido por:Carulla”
[16] " "
也就是说,我获得的是\t\t\t\t\t\t\t而不是$17.213
我将非常感谢您的帮助。该网站可能正在检查UA和Cookie,试图阻止您完全执行您正在执行的操作。我只是试着用wget下载它,结果发现一个403禁止的错误 如今,网络抓取的想法已经有点过时了,至少对商业页面来说是这样。有一些解决方法(例如,您可以查看download.file()的帮助,阅读wget和curl的手册页,了解如何更改UA和导入Cookie),但是如果您确实想大规模地执行此操作,您可能需要查看浏览器脚本,然后将该数据导入R
请记住,你正在做一些网站所有者不希望你做的事情。简言之,这几乎与R无关。网站可能正在检查UA和cookies,试图阻止你做你正在做的事情。我只是试着用wget下载它,结果发现一个403禁止的错误 如今,网络抓取的想法已经有点过时了,至少对商业页面来说是这样。有一些解决方法(例如,您可以查看download.file()的帮助,阅读wget和curl的手册页,了解如何更改UA和导入Cookie),但是如果您确实想大规模地执行此操作,您可能需要查看浏览器脚本,然后将该数据导入R 请记住,你正在做一些网站所有者不希望你做的事情。简而言之,这几乎与R无关