Web scraping Scrapy spider只返回列表中的最后一项
我正在构建一个scraper来抓取一个页面,并从一个div中返回多个项目(h3&p标签)。由于某些原因,scraper在调用时将打印所有“name”字段,但只保存页面上最后一个项目的信息 这是我的密码:Web scraping Scrapy spider只返回列表中的最后一项,web-scraping,scrapy,web-crawler,Web Scraping,Scrapy,Web Crawler,我正在构建一个scraper来抓取一个页面,并从一个div中返回多个项目(h3&p标签)。由于某些原因,scraper在调用时将打印所有“name”字段,但只保存页面上最后一个项目的信息 这是我的密码: import scrapy class FoodSpider(scrapy.Spider): name = 'food' allowed_domains = ['https://blog.feedspot.com/food_blogs/'] start_urls =
import scrapy
class FoodSpider(scrapy.Spider):
name = 'food'
allowed_domains = ['https://blog.feedspot.com/food_blogs/']
start_urls = ['https://blog.feedspot.com/food_blogs/']
def parse(self, response):
blogs = response.xpath("//div[@class='fsb v4']")
for blog in blogs:
names = blog.xpath('.//h3/a[@class="tlink"]/text()'[0:]).extract()
links = blog.xpath('.//p/a[@class="ext"]/@href'[0:]).extract()
locations = blog.xpath('.//p/span[@class="location"]/text()'[0:]).extract()
abouts = blog.xpath('.//p[@class="trow trow-wrap"]/text()[4]'[0:]).extract()
post_freqs = blog.xpath('.//p[@class="trow trow-wrap"]/text()[6]'[0:]).extract()
networks = blog.xpath('.//p[@class="trow trow-wrap"]/text()[9]'[0:]).extract()
for name in names:
name.split(',')
# print(name)
for link in links:
link.split(',')
for location in locations:
location.split(',')
for about in abouts:
about.split(',')
for post_freq in post_freqs:
post_freq.split(',')
for network in networks:
network.split(',')
yield {'name': name,
'link': link,
'location': location,
'about': about,
'post_freq': post_freq,
'network': network
}
有人知道我做错了什么吗?如果在DevTools中运行
/div[@class='fsb v4']
,它将只返回一个元素
因此,您必须找到一个选择器,以获取所有这些配置文件div
class FoodSpider(scrapy.Spider):
name = 'food'
allowed_domains = ['https://blog.feedspot.com/food_blogs/']
start_urls = ['https://blog.feedspot.com/food_blogs/']
def parse(self, response):
for blog in response.css("p.trow.trow-wrap"):
yield {'name': blog.css(".thumb.alignnone::attr(alt)").extract_first(),
'link': "https://www.feedspot.com/?followfeedid=%s" % blog.css("::attr(data)").extract_first(),
'location': blog.css(".location::text").extract_first(),
}
如果在DevTools中运行
//div[@class='fsbv4']
,它将只返回单个元素
因此,您必须找到一个选择器,以获取所有这些配置文件div
class FoodSpider(scrapy.Spider):
name = 'food'
allowed_domains = ['https://blog.feedspot.com/food_blogs/']
start_urls = ['https://blog.feedspot.com/food_blogs/']
def parse(self, response):
for blog in response.css("p.trow.trow-wrap"):
yield {'name': blog.css(".thumb.alignnone::attr(alt)").extract_first(),
'link': "https://www.feedspot.com/?followfeedid=%s" % blog.css("::attr(data)").extract_first(),
'location': blog.css(".location::text").extract_first(),
}