Java 使用Jsoup进行刮削
我正在使用Jsoup抓取一个电子商务网站。在这方面,我想得到标签,如,和价格。 在Jsoup.parse()之后,我无法获取此信息Java 使用Jsoup进行刮削,java,html,web-scraping,jsoup,Java,Html,Web Scraping,Jsoup,我正在使用Jsoup抓取一个电子商务网站。在这方面,我想得到标签,如,和价格。 在Jsoup.parse()之后,我无法获取此信息 <div id="ctl00_ContentPlaceHolder1_ctl00_ctl03_Showcase"> <div class="controlcontent_r"> <div class="bucketgroup"> <div class="prod_viewsparent"> <
<div id="ctl00_ContentPlaceHolder1_ctl00_ctl03_Showcase">
<div class="controlcontent_r">
<div class="bucketgroup">
<div class="prod_viewsparent">
<div class="bucket" style="width: 175px; height: 280px;">
<div class="bucket_left">
<a href="/Products/Buy-Online-Electronics-Cameras-Digital-Cameras/Nikon/Nikon-Coolpix-L27-Point--Shoot/pid-2849731.aspx">
<img class="mtb-img" style="width: 150px; height: 150px;" src="http://resources-images.martjackhosting.com/s3/martjack-resources/5d4b3aa1-119a-4d82-b9bb-1b6bdbd62002/Images/ProductImages/Source/NikonL27-BLK.jpg;width=150;height=150;scale=canvas" alt="Nikon Coolpix L27 Point & Shoot" title="Digital Cameras, Nikon, Nikon Coolpix L27 Point & Shoot"></a>
<div id="2849731" class="btn_quick_view" style="display:none">
<a rel="2849731,0,2466375,5d4b3aa1-119a-4d82-b9bb-1b6bdbd62002" href="#">Quick View</a></div>
<h4 class="mtb-title">Nikon Coolpix L27 Point & Shoot</h4>
<div class="mtb-desc">
<span class="mtb-price">
<label class="mtb-mrp">
<b class="lb1"> MRP </b>
<span class="WebRupee">Rs. </span>
4,990
</label>
<label class="mtb-ofr">
<b class="lb2"> Now At </b>
<span class="WebRupee">Rs. </span>
4,700
</label>
</span>
<span class="offer_block">
<a class="mtb-more" href="/Products/Buy-Online-Electronics-Cameras-Digital-Cameras/Nikon/Nikon-Coolpix-L27-Point--Shoot/pid-2849731.aspx" title="Click for more details"></div>
尼康Coolpix L27积分与投篮
物料需求计划
卢比。
4,990
现在
卢比。
4,700
解析后,我无法看到“div class=“bucket”标记
我如何处理这个问题?请给我们看一下您的代码好吗 顺便说一句,如果你想解析一个网站,最好使用
connect()
而不是parse()
下面是一个如何获取…
标记的示例:
final String url = "http://www.jabraat.com/categories/Buy-Digital-Cameras-Online/cid-CU00084377.aspx";
Document doc = Jsoup.connect(url).get();
for( Element element : doc.select("div.controlcontent_r") )
{
System.out.println(element);
System.out.println();
}
此代码打印三个元素(以空行分隔):
这里没有bucket
标记,这意味着您无法检索它(使用jsoup)-解决方案是使用另一个库来执行脚本
方便的是,我已经在这里发布了一个简短的列表:使用Selenium框架(google it)与Javascript交互。然后可以将该元素解析为JSoup元素。硒很容易吸收。我是在飞行中学会的 您提供的页面片段不完整-大多数div未关闭-您能给我们一个您正在抓取的站点的示例URL吗?Jsoup非常擅长处理损坏的HTML,但它无法处理您在这里提供的半个页面!:-)谢谢,这是我正在抓取的网址。谢谢&对不起,奥利奥,我打错了,找不到。你能给我发下标签的代码片段吗?谢谢,没问题,但不幸的是,这个标签的事情变得更复杂了。我会把这个编辑成我的答案,等一下。。。
<div class="controlcontent_r">
<div class="mtc-menu">
<ul class="mtc-cat">
<li class="mtc-block"><a class="mtc-a mtc-selected" title="Go To Digital Cameras" href="http://www.jabraat.com/categories/Buy-Digital-Cameras-Online/cid-CU00084377.aspx">Digital Cameras</a></li>
<li class="mtc-block"><a class="mtc-a" title="Go To Camcoders" href="http://www.jabraat.com/categories/Buy-Camcorders-Online/cid-CU00084380.aspx">Camcoders</a></li>
<li class="mtc-block1"><a class="mtc-a" title="Go To Camera Accessories" href="http://www.jabraat.com/categories/Buy-Camera-Accessories-Online/cid-CU00084381.aspx">Camera Accessories</a></li>
</ul>
</div>
</div>
<div class="controlcontent_r">
<div class="mtc-menu">
<ul class="mtc-cat">
<li class="mtc-block"><a class="mtc-a" title="Go To Camera" href="http://www.jabraat.com/categories/Buy-Cameras-Online/cid-CU00084376.aspx">Camera</a></li>
<li class="mtc-block"><a class="mtc-a" title="Go To Digital Photo Frames" href="http://www.jabraat.com/categories/Buy-Digital-Photo-Frames-Online/cid-CU00084382.aspx">Digital Photo Frames</a></li>
<li class="mtc-block1"><a class="mtc-a" title="Go To Mobiles" href="http://www.jabraat.com/categories/Buy-Mobiles-Online/cid-CU00084383.aspx">Mobiles</a></li>
</ul>
</div>
</div>
<div class="controlcontent_r">
<div class="mtc-menu">
<ul class="mtc-cat">
<li class="mtc-block"><a class="mtc-a" title="Go to Watches" href="http://www.jabraat.com/categories/Buy-Watches-Online/cid-CU00084370.aspx">Watches</a></li>
<li class="mtc-block"><a class="mtc-a" title="Go to Clothing" href="http://www.jabraat.com/categories/Buy-Online-Clothing/cid-CU00084420.aspx">Clothing</a></li>
<li class="mtc-block"><a class="mtc-a" title="Go to Mobiles" href="http://www.jabraat.com/categories/Buy-Mobiles-Online/cid-CU00084383.aspx">Mobiles</a></li>
<li class="mtc-block"><a class="mtc-a" title="Go to Cameras" href="http://www.jabraat.com/categories/Buy-Cameras-Online/cid-CU00084376.aspx">Cameras</a></li>
<li class="mtc-block"><a class="mtc-a" title="Go to Home & Kitchen" href="http://www.jabraat.com/categories/Buy-Home-Kitchen-Appliances-Online/cid-CU00084391.aspx">Home & Kitchen</a></li>
<li class="mtc-block"><a class="mtc-a" title="Go to Personal Care" href="http://www.jabraat.com/categories/Buy-Online-Personal-Care/cid-CU00084413.aspx">Personal Care</a></li>
<li class="mtc-block"><a class="mtc-a" title="Go to Jewellery" href="http://www.jabraat.com/categories/Buy-Online-Jewellery/cid-CU00084429.aspx">Jewellery</a></li>
<li class="mtc-block1"><a class="mtc-a" title="Go to Footwear" href="http://www.jabraat.com/categories/Buy-Online-Footwear/cid-CK00101771.aspx">Footwear</a></li>
</ul>
</div>
</div>
final String url = "http://www.jabraat.com/categories/Buy-Digital-Cameras-Online/cid-CU00084377.aspx";
Document doc = Jsoup.connect(url).get(); // Connect an parse the document (as above)
System.out.println(doc); // Output the document (= how jsoup "see"'s the website)