Scrapy 瘙痒以下兄弟姐妹强:结果不对
我正在使用scrapy来提取部分地址,我需要帮助了解地址的语法。下面是代码(如果这是无效代码,请道歉,不知道如何正确粘贴到问题中) 行Scrapy 瘙痒以下兄弟姐妹强:结果不对,scrapy,Scrapy,我正在使用scrapy来提取部分地址,我需要帮助了解地址的语法。下面是代码(如果这是无效代码,请道歉,不知道如何正确粘贴到问题中) 行item['address1']=site.select('strong[text()=“Physical Address”]/“following sibling::text()[1]”)返回字符串值[]。最后几个字符被剪裁 当我添加.extract()时,cmd中的值显示为[u'\n\t\t\t 123 address street,someplace,som
item['address1']=site.select('strong[text()=“Physical Address”]/“following sibling::text()[1]”)
返回字符串值[]
。最后几个字符被剪裁
当我添加.extract()
时,cmd中的值显示为[u'\n\t\t\t 123 address street,someplace,somewhere']
,但它们不会出现在输出表中
我已经找到了一个解决方案,并尝试了.select('text()').extract()
,但这也不对
我们将一如既往地感谢您的帮助
关于如何将页面源代码纳入本论坛问题的建议也将不胜感激。感谢使用您的示例URL,我建议您使用如下内容,选择具有类“result”的
div
s:
希望它能帮助你。顺便说一句,建议您将items=[]更改为items_list=[]或其他,因为这些项是scrapy的关键词,将来可能会发生冲突。
站点。选择('strong[text()=“Physical Address]”/following sibling::text()[1]')。extract()
应该可以正常工作,或者.extract()[0].strip()
要选择1个也是唯一的文本元素,而不使用前导空格和一行空格,至少在使用HTML样本进行本地测试时(使用sites=hxs.select('/*[@class=“result”]')
),我添加了.extract()[0].strip()
,出现了此错误异常。索引器:列表索引超出范围
。有什么想法吗?这意味着目标strong
元素没有兄弟文本元素。请提供输入的完整HTML示例。也许[text()=“Physical Address”]
太严格了,[contains(,“Physical Address”)]
可能会更宽容些。将此粘贴到我的蜘蛛中,它抛出了results=hxs。select('id(“search results”)/div[@class=“result”]:
语法错误:无效
,冒号是违规字符。真令人困惑。还有其他想法吗?Thanks删除了冒号并运行了它,产生了相同的结果-要么值附加了完整的xpath(并且文本被截断),要么它们显示在cmd中,但不显示在输出文件中。其他信息,在cmd中,它看起来像这样{'address1':[u'\n\t\t\t 100惠灵顿街,弗里曼湾,奥克兰1011'],
my bad,“:”是因为我在hxs中用测试了结果。select('id(“search results”)/div[@class=“result”]):
well。extract()
始终返回一个列表,该列表可以为空。如果您只需要此列表中的第一个元素,可以在每个.extract()之后添加[0]
。此外,您可能希望在开头去除空白字符,因此可以执行项['address1']=map(unicode.strip,result.select('strong[text()=“Physical Address”]./following sibling::text()[1]')。extract())[0]
<div class="result">
<h3>
<a href="/provider/service/xxxxx/">service name</a>
</h3>
<p>
"blah blah"
</p>
<strong>Physical Address</strong>
"123 address street, someplace, somewhere"
<br/>
<strong>Postcode</strong>
"xxx"
<br/>
<strong>District/town</strong>
"someplace"
<br/>
<strong>Region</strong>
"someplace bigger"
<br/>
<strong>Phone</strong>
"xx xxx xxxx"
<br/><strong>Fax Number</strong>
"xx xxx xxxx"
<br/>
<!--strong>Email</strong-->
<a href="#" onclick="window.location=('mail'+'to:'+'xxxxx'+''+'@'+'xxxx.xx.xx'+''); return false;">
"xxxxx"
<strong></strong>
"xxxxx.xx.xx"
</a>
<a rel="nofollow" class="printlist-add" href="/provider/print-list/add/xxxx/">Add to print list</a>
</div>
<hr/>
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from test.items import TestItem
class NewSpider(BaseSpider):
name = "my_spider"
download_delay = 2
allowed_domains = ["website.com"]
start_urls = [
"http://website.com/site1"
]
def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//*[@id="search-results"]/div')
items = []
for site in sites:
item = WebhealthItem()
item['practice'] = site.select('h3/a/text()').extract()
item['url'] = site.select('h3/a/@href').extract()
item['address1'] = site.select('strong[text() = "Physical Address"]/following-sibling::text()[1]')
items.append(item)
return items
def parse(self, response):
hxs = HtmlXPathSelector(response)
results = hxs.select('id("search-results")/div[@class="result"]')
items = []
for result in results:
item = WebhealthItem()
item['practice'] = result.select('h3/a/text()').extract()[0]
item['url'] = result.select('h3/a/@href').extract()[0]
item['address1'] = map(
unicode.strip,
result.select('strong[text() = "Physical Address"]/following-sibling::text()[1]').extract()
)[0]
items.append(item)
return items
def caiqinghua_array_string_strip(array_string):
if(array_string == []):
return ''
else:
#print 'item::: ', array_string[0].strip()
string = array_string[0].replace('\\r\\n', '')
return string.strip()
def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//*[@id="search-results"]/div')
items = []
for site in sites:
item = WebhealthItem()
item['practice'] = site.select('h3/a/text()').extract()
item['url'] = site.select('h3/a/@href').extract()
address = site.select('strong[text() = "Physical Address"]/following-sibling::text()[1]')
item['address1'] = caiqinghua_array_string_strip(address)
items.append(item)
return items