Python 如何让scrapy跟随javascript生成的url？_Python_Selenium_Web Crawler_Scrapy_Scrapy Spider

Python 如何让scrapy跟随javascript生成的url？

python selenium web-crawler scrapy

Python 如何让scrapy跟随javascript生成的url？,python,selenium,web-crawler,scrapy,scrapy-spider,Python,Selenium,Web Crawler,Scrapy,Scrapy Spider,我想抓取这个网站上的新闻：new.scut.edu.cn 但在其子网站中，下一页（中文）下一页) 右下角的url由javascript生成。下一页的html源代码是，引用脚本是 var _currentPageIndex =346; var _listArticleCount =-1; var _listPaginationCount =-1; function a_next(url) { if(_currentPageIndex > 1) {

我想抓取这个网站上的新闻：

new.scut.edu.cn

但在其子网站中，下一页（中文）下一页) 右下角的url由javascript生成。下一页的html源代码是

，引用脚本是

var _currentPageIndex =346;
var _listArticleCount =-1;       
var _listPaginationCount =-1; 
function a_next(url) {           
if(_currentPageIndex > 1) {               
location.href =url.replace('i/','i/'+(_currentPageIndex-1));
}                
}

我想抓取所有页面，因此爬行器需要跟随下一页。以下是我的爬行器代码：

# -*- coding: utf-8 -*-

import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.selector import Selector
from scutnews.items import ScutnewsItem
from scrapy.http import Request, FormRequest
import re

class NewsSpider(CrawlSpider):
    name = "scutnews"
    allowed_domain = ["news.scut.edu.cn"]
    start_urls = ["news.scut.edu.cn"]

    rules = (
            Rule(LinkExtractor(allow=(r"http://news.scut.edu.cn/s/22/t/.+/list.*"))),
            Rule(LinkExtractor(allow=(r"http://news.scut.edu.cn/s/22/t/.+/info.*")), callback = "parse_item")
            )

    def start_requests(self):
        yield FormRequest("http://news.scut.edu.cn", headers={'User-Agent':'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:37.0) Gecko/20100101 Firefox/37.0'})

    def parse_item(self, response):
        sel = Selector(response)
        item = ScutnewsItem()
       # item['title'] = sel.xpath('//div[@class="display_news_con"]/h1/text()').extract()
       # item['time'] = sel.xpath('//span[@class="posttime"]/text()').extract()
        item['content'] = sel.xpath('//div[@class="infobox"]/div[1]/p/text()|//div[@class="infobox"]/div[1]/p/span/text()|//div[@class="infobox"]/div[1]/p/span/span/text()|//div[@class="infobox"]/div[1]/p/span/span/span/text()|//div[@class="infobox"]/div[1]/text()').extract()
       # item['url'] = response.url
        return item

我发现当前页面url与下一页面url只有一个数字不同。我知道有一些解决方案，模拟javascript逻辑或使用selenium和phantomjs之类的库。我如何通过模拟js逻辑的方式修复scrapy spider代码，以便进入下一页？需要更改scrapy spider规则吗？以selenium或phantomjs的方式如何？

提前感谢

我想提出一种不呈现javascript，而是从页面中提取javascript信息的方法

您可以在

列表页面

rules = (
    Rule(LinkExtractor(allow=(r"http://news.scut.edu.cn/s/22/t/.+/list.*")), callback = "parse_list"),
    Rule(LinkExtractor(allow=(r"http://news.scut.edu.cn/s/22/t/.+/info.*")), callback = "parse_item")
)

并在回调中实现一个正则表达式来解析

javascript

，并获取（列表中）总页数：

如果存在

页码

，则可以在循环中创建所有页码链接（一直到第一页），并将这些

请求

传递给爬虫程序

上面显示的代码不起作用，但可以作为起点。

提示：scrapyjs中间件在这里非常有用，无需使用真正的浏览器。

def parse_list(self, response):
    sel = Selector(response)
    xpath_pageCounter = './/script[@language="javascript" and contains(.,"currentPageIndex")]'
    pageCounter = sel.xpath(xpath_pageCounter).re(r'currentPageIndex =(\d+);')
    if pageCounter:
        page_Number = int(pageCounter[0]) - 1
        page_url = response.url.replace('/list.htm', '/i/' + str(page_Number) + '/list.htm')
        print '#####', response.url, page_Number, page_url
        yield scrapy.FormRequest(page_url, callback=self.parse_item)