Python 刮痕:每件一行
我正在使用scrapy为architonic*com上的产品清理分类页面。但是,我希望以csv格式显示这些产品,每行显示一个。在当前情况下,给定类别页面中的所有品牌名称都列在“品牌”下,而我希望有如下输出:Python 刮痕:每件一行,python,web-scraping,scrapy,Python,Web Scraping,Scrapy,我正在使用scrapy为architonic*com上的产品清理分类页面。但是,我希望以csv格式显示这些产品,每行显示一个。在当前情况下,给定类别页面中的所有品牌名称都列在“品牌”下,而我希望有如下输出: {'brand': [u'Elisabeth Ellefsen'], 'title': [u'Up chair I 907'], 'img_url': [u'http://image.architonic.com/img_pro1-1/117/4373/t-up-
{'brand': [u'Elisabeth Ellefsen'],
'title': [u'Up chair I 907'],
'img_url': [u'http://image.architonic.com/img_pro1-1/117/4373/t-up-06f-sq.jpg'],
'link': [u'http://www.architonic.com/pmsht/up-chair-tonon/1174373']
}
2013-01-14 11:53:23+0100 [archi] DEBUG: Scraped from <200 http://www.architonic.com/pmpro/home-furnishings/3210002/2/2/3>
{'brand': [u'Softline',
u'Elisabeth Ellefsen',
u'Sellex',
u'Lievore Altherr Molina',
u'Poliform',
.....
u'Hans Thyge & Co.'],
'img_url': [u'http://image.architonic.com/img_pro1-1/117/3661/terra-h-sq.jpg',
u'http://image.architonic.com/img_pro1-1/117/0852/fly-01-sq.jpg',
u'http://image.architonic.com/img_pro1-1/116/9870/ley-0004-sq.jpg',
u'http://image.architonic.com/img_pro1-1/117/1023/arflex-hollywood-03-sq.jpg',
...
u'http://image.architonic.com/img_pro1-1/118/5357/reef-002-sq.jpg'],
'link': [u'http://www.architonic.com/pmsht/terra-softline/1173661',
u'http://www.architonic.com/pmsht/fly-sellex/1170852',
u'http://www.architonic.com/pmsht/ley-poliform/1169870',
.....
u'http://www.architonic.com/pmsht/reef-collection-labofa/1185357'],
'title': [u'Terra',
u'Fly',
u'Ley chair',
.....
u'Hollywood Sofa',
u'Pouff Round']}
我尝试使用项目加载器(添加了default\u output\u processor=TakeFirst()),添加了“yield Item”(参见注释代码),并在两天内搜索以找到解决方案,但运气不佳。希望有人愿意帮助我。非常感谢您的帮助
我的输出结果如下所示:
{'brand': [u'Elisabeth Ellefsen'],
'title': [u'Up chair I 907'],
'img_url': [u'http://image.architonic.com/img_pro1-1/117/4373/t-up-06f-sq.jpg'],
'link': [u'http://www.architonic.com/pmsht/up-chair-tonon/1174373']
}
2013-01-14 11:53:23+0100 [archi] DEBUG: Scraped from <200 http://www.architonic.com/pmpro/home-furnishings/3210002/2/2/3>
{'brand': [u'Softline',
u'Elisabeth Ellefsen',
u'Sellex',
u'Lievore Altherr Molina',
u'Poliform',
.....
u'Hans Thyge & Co.'],
'img_url': [u'http://image.architonic.com/img_pro1-1/117/3661/terra-h-sq.jpg',
u'http://image.architonic.com/img_pro1-1/117/0852/fly-01-sq.jpg',
u'http://image.architonic.com/img_pro1-1/116/9870/ley-0004-sq.jpg',
u'http://image.architonic.com/img_pro1-1/117/1023/arflex-hollywood-03-sq.jpg',
...
u'http://image.architonic.com/img_pro1-1/118/5357/reef-002-sq.jpg'],
'link': [u'http://www.architonic.com/pmsht/terra-softline/1173661',
u'http://www.architonic.com/pmsht/fly-sellex/1170852',
u'http://www.architonic.com/pmsht/ley-poliform/1169870',
.....
u'http://www.architonic.com/pmsht/reef-collection-labofa/1185357'],
'title': [u'Terra',
u'Fly',
u'Ley chair',
.....
u'Hollywood Sofa',
u'Pouff Round']}
items.py
import string
import re
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.utils.markup import remove_entities
from archiscraper.items import ArchiItemFields, ArchiLoader
class ArchiScraper(BaseSpider):
name = "archi"
allowed_domains = ["architonic.com"]
start_urls = ['http://www.architonic.com/pmpro/home-furnishings/3210002/2/2/%s' % page for page in xrange(1, 4)]
# rules = (Rule(SgmlLinkExtractor(allow=('.', ),restrict_xpaths=('//*[@id="right_arrow"]',))
# , callback="parse_items", follow= True),
# )
#
def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//li[contains(@class, "nav_pro_item")]')
items = []
for site in sites:
item = ArchiLoader(ArchiItemFields(), site)
item.add_xpath('brand', '//*[contains(@class, "nav_pro_text")]/a/br/following-sibling::node()[1][self::text()]')
item.add_xpath('designer', '//*[contains(@class, "nav_pro_text")]/a/br/following-sibling::node()[3][self::text()]')
item.add_xpath('title', '//*[contains(@class, "nav_pro_text")]/a/strong/text()')
item.add_xpath('img_url', '//li[contains(@class, "nav_pro_item")]/div/a/img/@src[1]')
item.add_xpath('link', '//*[contains(@class, "nav_pro_text")]/a/@href')
items.append(item.load_item())
return items
# for item in items:
# yield item
# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/topics/items.html
import string
from scrapy.item import Item, Field
from scrapy.contrib.loader.processor import MapCompose, Join, TakeFirst
from scrapy.utils.markup import remove_entities
from scrapy.contrib.loader import XPathItemLoader
class ArchiItem():
pass
class ArchiItemFields(Item):
brand = Field()
title = Field()
designer = Field()
img_url = Field()
img = Field()
link = Field()
pass
class ArchiLoader(XPathItemLoader):
# default_input_processor = MapCompose(unicode.strip)
# default_output_processor= TakeFirst()
brand_out = MapCompose(unicode.strip)
# title_out = Join()
只需在结束后返回旅游项目列表,即
for site in sites:
item = ArchiLoader(ArchiItemFields(), site)
item.add_xpath('brand', '//*[contains(@class, "nav_pro_text")]/a/br/following-sibling::node()[1][self::text()]')
item.add_xpath('designer', '//*[contains(@class, "nav_pro_text")]/a/br/following-sibling::node()[3][self::text()]')
item.add_xpath('title', '//*[contains(@class, "nav_pro_text")]/a/strong/text()')
item.add_xpath('img_url', '//li[contains(@class, "nav_pro_item")]/div/a/img/@src[1]')
item.add_xpath('link', '//*[contains(@class, "nav_pro_text")]/a/@href')
items.append(item.load_item())
return items
希望有帮助:)请包含items.py文件,因为没有它,此代码将无法运行。:)谢谢,我添加了items.py!:)CSV?但您的数据结构是JSON。那怎么了?如果您想要单独的项目列表,您应该首先获取项目容器,然后从中提取所需的数据。这是Scrapy的输出。没问题-我在settings.py中定义了一个自定义提要导出器/管道以导出到csv。有什么解决方案可以解决我的问题吗?这个答案是否提供了@Joost的值?