Python 空csv文件
我正在尝试运行我的scrapy spider,它不会返回错误,但会输出一个空的csv文件 我正在通过命令行scrapy crawl AnimeReviews-o AnimeReviews.csv-t csv启动蜘蛛 这是我用过的图书馆Python 空csv文件,python,scrapy,Python,Scrapy,我正在尝试运行我的scrapy spider,它不会返回错误,但会输出一个空的csv文件 我正在通过命令行scrapy crawl AnimeReviews-o AnimeReviews.csv-t csv启动蜘蛛 这是我用过的图书馆 import scrapy import json from functools import reduce from scrapy.selector import Selector from AnimeReviews.items import Animerevi
import scrapy
import json
from functools import reduce
from scrapy.selector import Selector
from AnimeReviews.items import AnimereviewsItem
last_page = 1789
这是我的蜘蛛
class AnimeReviewsSpider(scrapy.Spider):
name = 'AnimeReviews_spider'
allowed_urls =['myanimelist.net']
start_urls = ['https://myanimelist.net/reviews.php?t=anime']
def parse(self, response):
page_urls = [response.url + "&p=" + str(pageNumber) for pageNumber in range(1, last_page+1)]
for page_url in page_urls:
yield scrapy.Request(page_url,
callback = self.parse_reviews_page)
def parse_reviews_page(self, response):
item = AnimereviewsItem()
reviews = response.xpath('//*[@class="borderDark pt4 pb8 pl4 pr4 mb8"]').extract() #each page displays 50 reviews
for review in reviews:
anime_title = Selector(text = review).xpath('//div[1]/a[1]/strong/text()').extract()
anime_url = Selector(text = review).xpath('//a[@class="hoverinfo_trigger"]/@href').extract()
anime_url = map(lambda x: 'https://myanimelist.net'+ x ,anime_url)
review_time = Selector(text = review).xpath('//*[@style="float: right;"]/text()').extract()[0]
reviewer_name = Selector(text = review).xpath('//div[2]/table/tr/td[2]/a/text()').extract()
rating = Selector(text = review).xpath('//div[2]/table/tr/td[3]/div[2]/text()').extract()
for i in range(len(rating)):
rating_temp = rating[i]
rating[i] = rating_temp.split(" ")[1]
review_text = Selector(text = review).xpath('//*[@class="spaceit textReadability word-break"]').extract()
for i in range(len(review_text)):
text = Selector(text = review_text[i]).xpath('//text()').extract()
pic_url = Selector(text = review).xpath('//div[3]/div[1]/div[1]/a/img/@data-src').extract()
item['anime_title'] = anime_title
item['anime_url'] = anime_url
item['review_time'] = review_time
item['reviewer'] = reviewer_name
item['rating'] = rating
item['review_text'] = review_text
item['pic_url'] = pic_url
yield item
这是爬网后的日志
2018-06-22 13:37:14 [scrapy.core.engine] INFO: Closing spider (finished)
2018-06-22 13:37:14 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 698849,
'downloader/request_count': 1791,
'downloader/request_method_count/GET': 1791,
'downloader/response_bytes': 148209070,
'downloader/response_count': 1791,
'downloader/response_status_count/200': 1791,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2018, 6, 22, 11, 37, 14, 546133),
'log_count/DEBUG': 1792,
'log_count/INFO': 13,
'request_depth_max': 1,
'response_received_count': 1791,
'scheduler/dequeued': 1790,
'scheduler/dequeued/memory': 1790,
'scheduler/enqueued': 1790,
'scheduler/enqueued/memory': 1790,
'start_time': datetime.datetime(2018, 6, 22, 11, 30, 38, 403920)}
2018-06-22 13:37:14 [scrapy.core.engine] INFO: Spider closed (finished)
如果你需要更多的信息,请告诉我 这里最大的问题是xpath表达式。
它们看起来像是自动生成的,而且太具体了 例如,即使是
评论的xpath也不匹配任何内容。
像//div[@class=“borderDark”]
这样简单的东西匹配一个页面上的所有50条评论,css表达式.borderDark
也是如此
我建议您熟悉xpath和/或css选择器,并手工编写选择器
此外,您正在将选择器转换为文本(使用.extract
),然后返回选择器(使用选择器
)。不需要这样做,只需使用.xpath
返回的选择器即可