Python lxml etree.parse.xpath()返回只包含制表符和换行符的项
对于典型的易趣搜索结果页面,例如,我使用lxml提取每个结果的价格,因此:Python lxml etree.parse.xpath()返回只包含制表符和换行符的项,python,xpath,lxml,Python,Xpath,Lxml,对于典型的易趣搜索结果页面,例如,我使用lxml提取每个结果的价格,因此: import urllib2 from lxml import etree url = "http://www.ebay.com/sch/i.html?rt=nc&LH_Complete=1&_nkw=Mizuno+Pants+Baseball&LH_Sold=1&_sacat=0&LH_BIN=1&_from=R40&_sop=3&LH_ItemCon
import urllib2
from lxml import etree
url = "http://www.ebay.com/sch/i.html?rt=nc&LH_Complete=1&_nkw=Mizuno+Pants+Baseball&LH_Sold=1&_sacat=0&LH_BIN=1&_from=R40&_sop=3&LH_ItemCondition=1000"
response = urllib2.urlopen(url)
htmlparser = etree.HTMLParser()
tree = etree.parse(response, htmlparser)
xpathselector="//span[@class ='bold bidsold']/text()"
tree.xpath(xpathselector)
虽然有50搜索结果(因此也有价格),但tree.xpath(xpathselector)返回一个长度100的列表,其中包含所有价格,但也包含除换行符和制表符以外的项目(忽略这些结果与网页上的价格差异-这是由于我的地理位置)。这是为什么
['\n\t\t\t\t\t',
'\n\t\t\t\t\t\t\t\t',
'\n\t\t\t\t\t',
u' 1\xc2\xa0049.27',
'\n\t\t\t\t\t',
' 965.31',
'\n\t\t\t\t\t',
'\n\t\t\t\t\t\t\t\t',
'\n\t\t\t\t\t',
' 883.56',
'\n\t\t\t\t\t',
'\n\t\t\t\t\t\t\t\t',
'\n\t\t\t\t\t',
'\n\t\t\t\t\t\t\t\t',
'\n\t\t\t\t\t',
'\n\t\t\t\t\t\t\t\t',
'\n\t\t\t\t\t',
' 827.21',
'\n\t\t\t\t\t',
' 827.21',
'\n\t\t\t\t\t',
' 827.21',
'\n\t\t\t\t\t',
' 827.21',
'\n\t\t\t\t\t',
' 800.97',
'\n\t\t\t\t\t',
' 799.59',
'\n\t\t\t\t\t',
'\n\t\t\t\t\t\t\t\t',
'\n\t\t\t\t\t',
'\n\t\t\t\t\t\t\t\t',
'\n\t\t\t\t\t',
' 716.73',
'\n\t\t\t\t\t',
' 716.73',
'\n\t\t\t\t\t',
' 716.73',
'\n\t\t\t\t\t',
' 690.22',
'\n\t\t\t\t\t',
' 662.60',
'\n\t\t\t\t\t',
' 662.60',
'\n\t\t\t\t\t',
' 635.25',
'\n\t\t\t\t\t',
' 606.25',
'\n\t\t\t\t\t',
' 606.25',
'\n\t\t\t\t\t',
' 552.39',
'\n\t\t\t\t\t',
' 552.39',
'\n\t\t\t\t\t',
' 552.39',
'\n\t\t\t\t\t',
' 552.39',
'\n\t\t\t\t\t',
'\n\t\t\t\t\t\t\t\t',
'\n\t\t\t\t\t',
'\n\t\t\t\t\t\t\t\t',
'\n\t\t\t\t\t',
'\n\t\t\t\t\t\t\t\t',
'\n\t\t\t\t\t',
'\n\t\t\t\t\t\t\t\t',
'\n\t\t\t\t\t',
' 551.01',
'\n\t\t\t\t\t',
' 551.01',
'\n\t\t\t\t\t',
' 517.59',
'\n\t\t\t\t\t',
' 497.16',
'\n\t\t\t\t\t',
' 496.88',
'\n\t\t\t\t\t',
' 496.88',
'\n\t\t\t\t\t',
' 496.60',
'\n\t\t\t\t\t',
'\n\t\t\t\t\t\t\t\t',
'\n\t\t\t\t\t',
' 469.26',
'\n\t\t\t\t\t',
'\n\t\t\t\t\t\t\t\t',
'\n\t\t\t\t\t',
' 468.15',
'\n\t\t\t\t\t',
' 414.30',
'\n\t\t\t\t\t',
' 414.02',
'\n\t\t\t\t\t',
' 414.02',
'\n\t\t\t\t\t',
' 414.02',
'\n\t\t\t\t\t',
' 414.02',
'\n\t\t\t\t\t',
' 386.68']
直接位于目标
span
中的换行符和其他空格也是文本节点,因此它由xpath中的span[…]/text()
选择器选择。您可以在谓词中使用xpathnormalize-space()
函数过滤空文本节点,但:
xpathselector="//span[@class ='bold bidsold']/text()[normalize-space()]"
输出:
['506,533.33', '506,000.00', '466,000.00', '399,333.33', '399,333.33', '399,333.33', '399,333.33', '399,333.33', '386,666.67', '386,000.00', '346,000.00', '346,000.00', '346,000.00', '333,200.00', '333,200.00', '333,066.67', '319,866.67', '319,866.67', '306,666.67', '293,066.67', '292,666.67', '292,666.67', '266,666.67', '266,666.67', '266,666.67','266666.67','266533.33','266533.33','266533.33','266000.00','266000.00','253200.00','249866.67','240000.00','239866.67','239866.67','239866.67','239866.67','239866.67','239863.33','226533.33']
不知道
normalize-space()
函数。谢谢。