Python 利用beautifulsoup和re进行数据提取_Python_Regex_Beautifulsoup

Python 利用beautifulsoup和re进行数据提取

python regex

Python 利用beautifulsoup和re进行数据提取,python,regex,beautifulsoup,Python,Regex,Beautifulsoup,我正试图从jb hifi中提取间谍信息，以下是我所做的： from BeautifulSoup import BeautifulSoup import urllib2 import re url="http://www.jbhifionline.com.au/support.aspx?post=1&results=10&source=all&bnSearch=Go!&q=ipod&submit=Go" page=urllib2.urlopen(ur

我正试图从jb hifi中提取间谍信息，以下是我所做的：

from BeautifulSoup import BeautifulSoup
import urllib2
import re



url="http://www.jbhifionline.com.au/support.aspx?post=1&results=10&source=all&bnSearch=Go!&q=ipod&submit=Go"

page=urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
Item0=soup.findAll('td',{'class':'check_title'})[0]    
print (Item0.renderContents())

输出为：

Apple iPod Classic 160GB (Black)Â 
<span class="SKU">MC297ZP/A</span>

我尝试使用re删除其他信息

 print(Item0.renderContents()).replace{^<span:,""}

打印（Item0.renderContents（））。替换{^不要使用.renderContents（）
；它充其量只是一个调试工具
只要有第一个孩子：
>>> Item0.contents[0]
u'Apple iPod Classic 160GB (Black)\xc2\xa0\r\n\t\t\t\t\t\t\t\t\t\t\t'
>>> Item0.contents[0].strip()
u'Apple iPod Classic 160GB (Black)\xc2'

BeautifulSoup似乎没有完全正确地猜测编码，因此非中断空间（U+00a0）显示为两个单独的字节，而不是一个。看起来BeautifulSoup猜错了：
>>> soup.originalEncoding
'iso-8859-1'

您可以使用响应头强制编码；此服务器设置了字符集：
>>> page.info().getparam('charset')
'utf-8'
>>> page=urllib2.urlopen(url)
>>> soup = BeautifulSoup(page.read(), fromEncoding=page.info().getparam('charset'))
>>> Item0=soup.findAll('td',{'class':'check_title'})[0]
>>> Item0.contents[0].strip()
u'Apple iPod Classic 160GB (Black)'

fromEncoding
参数告诉BeautifulSoup使用UTF-8而不是拉丁语1，现在非中断空间被正确剥离
>>> page.info().getparam('charset')
'utf-8'
>>> page=urllib2.urlopen(url)
>>> soup = BeautifulSoup(page.read(), fromEncoding=page.info().getparam('charset'))
>>> Item0=soup.findAll('td',{'class':'check_title'})[0]
>>> Item0.contents[0].strip()
u'Apple iPod Classic 160GB (Black)'