Python组解析_Python_Screen Scraping_Beautifulsoup

Python组解析

python

Python组解析,python,screen-scraping,beautifulsoup,Python,Screen Scraping,Beautifulsoup,我正试图抓取一些内容（对Python来说是非常陌生的），但我遇到了一个绊脚石。我试图搜集的代码是： <h2><a href="/best-sellers/sj-b9822.html">Spear & Jackson Predator Universal Hardpoint Saw - 22"</a></h2> <p><span class="productlist_mostwanted_rrp"> W

我正试图抓取一些内容（对Python来说是非常陌生的），但我遇到了一个绊脚石。我试图搜集的代码是：

<h2><a href="/best-sellers/sj-b9822.html">Spear & Jackson Predator Universal Hardpoint Saw     - 22"</a></h2>
<p><span class="productlist_mostwanted_rrp">    
Was: <span class="strikethrough">£12.52</span></span><span class="productlist_mostwanted_save">Save: £6.57(52%)</span></p>

<div class="clear"></div>

<p class="productlist_mostwanted_price">Now: £5.95</p>

然而，我追求的结果只有5.95。我尝试使用以下方式获取链接文本（Spear&Jackson）的成功率也有限：

当然，这只返回第一个结果

我的最终目标是使结果如下所示：

Spear & Jackson Predator Universal Hardpoint Saw - 22 5.95
etc
etc

由于我希望将其导出到csv，我需要弄清楚如何将数据放入两列。就像我说的，我对python非常陌生，所以我希望这是有意义的

谢谢你的帮助

非常感谢

我想你要找的是这样的东西：

from BeautifulSoup import BeautifulSoup
import re

soup = BeautifulSoup(open('prueba.html').read())
item = re.sub('\s+', ' ', soup.h2.a.text)
price = soup.find('p', {'class': 'productlist_mostwanted_price'}).text
price = re.search('\d+\.\d+', price).group(0)

print item, price

示例输出：

Spear&Jackson Predator通用硬点锯-22英寸5.95

请注意，对于项目，正则表达式仅用于删除额外的空格，而对于价格，则用于捕获数字。

html=''
html = '''
<h2><a href="/best-sellers/sj-b9822.html">Spear & Jackson Predator Universal Hardpoint Saw     - 22</a></h2>
<p><span class="productlist_mostwanted_rrp">    
Was: <span class="strikethrough">&pound;12.52</span></span><span class="productlist_mostwanted_save">Save: &pound;6.57(52%)</span></p>
<div class="clear"></div>
<p class="productlist_mostwanted_price">Now: &pound;5.95</p>
'''

from BeautifulSoup import BeautifulSoup
import re

soup = BeautifulSoup(html)
desc = soup.h2.a.getText()
price_str = soup.find('p', {"class": "productlist_mostwanted_price" }).getText()
price = float(re.search(r'[0-9.]+', price_str).group())

print desc, price


Was:12.52英镑储蓄：6.57英镑（52%）
现在：5.95英镑
'''
从BeautifulSoup导入BeautifulSoup
进口稀土
soup=BeautifulSoup（html）
desc=soup.h2.a.getText（）
price\u str=soup.find（'p'，{“class”：“productlist\u mostwated\u price”}）.getText（）
price=float（重新搜索（r'[0-9.]+'，price_str.group（））
打印说明、价格

谢谢！这就抓住了第一个。您介意告诉我如何构造循环以返回所有结果吗，因为这段代码只返回第一个结果，尽管非常完美。这取决于它们在DOM中的位置（如果它们在同一页面中）。是的，它们在同一页中。我尝试过创建循环，但我的努力只值得嘲笑。对于编程来说，这仍然是一个全新的过程！有人能帮我循环这个过程，让它返回所有结果而不是第一个结果吗？谢谢，正如我所说，它们取决于DOM上的位置。请提供缺少的信息或询问新问题作为这一个的后续。谢谢，但是它的价格不正确（说5.0）不用循环了，谢谢though@PeterStannett真的吗？通过Python2.7运行它会得到

Spear&Jackson Predator Universal Hardpoint Saw-22 5.95

。你有剪切和粘贴问题吗？问题完全是我。抱歉！我以复制和粘贴的方式重新运行了它，而不是我键入了它，这很好！你会吗[0-9]是指如果一个产品售价为16英镑，它将无法打印？同时，我正在努力让代码执行多个操作。我非常感谢您的帮助。有人能帮我尝试构建一个循环，以便在一个页面中返回所有结果吗？我已经尝试过，但正在努力使其正常工作！谢谢。

from BeautifulSoup import BeautifulSoup
import re

soup = BeautifulSoup(open('prueba.html').read())
item = re.sub('\s+', ' ', soup.h2.a.text)
price = soup.find('p', {'class': 'productlist_mostwanted_price'}).text
price = re.search('\d+\.\d+', price).group(0)

print item, price

html = '''
<h2><a href="/best-sellers/sj-b9822.html">Spear & Jackson Predator Universal Hardpoint Saw     - 22</a></h2>
<p><span class="productlist_mostwanted_rrp">    
Was: <span class="strikethrough">&pound;12.52</span></span><span class="productlist_mostwanted_save">Save: &pound;6.57(52%)</span></p>
<div class="clear"></div>
<p class="productlist_mostwanted_price">Now: &pound;5.95</p>
'''

from BeautifulSoup import BeautifulSoup
import re

soup = BeautifulSoup(html)
desc = soup.h2.a.getText()
price_str = soup.find('p', {"class": "productlist_mostwanted_price" }).getText()
price = float(re.search(r'[0-9.]+', price_str).group())

print desc, price