以下是javascript中的URL-Scrapy Splash

以下是javascript中的URL-Scrapy Splash,scrapy,web-crawler,scrapy-splash,Scrapy,Web Crawler,Scrapy Splash,我对网页抓取非常陌生。我设法从静态网站中提取信息,但现在我正在尝试手动跟踪URL并提取数据(这当然涉及一些javascript)。我已经安装了scrapy splash,运行非常好。 我试图浏览的网站是,右上角的按钮将带您进入下一页(这是javascript,因此是splash)。我想把所有页面上的一些基本数据(如公司名称、行业等)刮到最后一页。这就是我到目前为止所做的,我需要帮助来纠正这一点,以便成功地执行 import scrapy from scrapy_splash import Sp

我对网页抓取非常陌生。我设法从静态网站中提取信息,但现在我正在尝试手动跟踪URL并提取数据(这当然涉及一些javascript)。我已经安装了scrapy splash,运行非常好。 我试图浏览的网站是,右上角的按钮将带您进入下一页(这是javascript,因此是splash)。我想把所有页面上的一些基本数据(如公司名称、行业等)刮到最后一页。这就是我到目前为止所做的,我需要帮助来纠正这一点,以便成功地执行


import scrapy
from scrapy_splash import SplashRequest
import urllib.parse as urlparse


class TAFolio(scrapy.Spider):
    name = 'Portfolio'
    start_urls = ['https://www.ta.com/portfolio/investments/ari-network-services-inc']

    def start_requests(self):
        for url in self.start_urls:
            yield SplashRequest(url=url, callback = self.parse, args={"wait" : 3})

    def parse(self, response):

        companyname = response.css('h1.item_detail-main-info-heading::text').extract_first()
        sectors = response.css('.item_detail-main-info-group-item::text')[0].extract()
        investmentyear = response.css('.item_detail-main-info-group-item::text')[1].extract()
        status = response.css('.item_detail-main-info-group-item::text')[2].extract()
        location = response.css('.item_detail-main-info-group-item::text')[3].extract()
        region = response.css('.item_detail-main-info-group-item::text')[4].extract()
        team = response.css('div.item_detail-main-info-group a::text').extract()

        yield {
        'companyname': companyname,
        'sectors': sectors,
        'investmentyear': investmentyear,
        'status': status,
        'location': location,
        'region': region,
        'team': team
        }

        next_page = response.css('li.item_detail-nav-item--next a::attr(href)').extract()


        if next_page is not None:
            yield SplashRequest(urlparse.urljoin('https://www.ta.com',next_page),callback=self.parse, args={"wait":3})


这为我提供了开始url的正确信息,但不会进入下一页。

更新。问题是按照我浏览网站的顺序。下面是运行良好的更新代码

import scrapy
from scrapy_splash import SplashRequest
import urllib.parse as urlparse


class TAFolio(scrapy.Spider):
    name = 'Portfolio'
    start_urls = [

    'https://www.ta.com/portfolio/business-services',
    'https://www.ta.com/portfolio/consumer',
    'https://www.ta.com/portfolio/financial-services',
    'https://www.ta.com/portfolio/healthcare',
    'https://www.ta.com/portfolio/technology'
    ]

    def start_requests(self):
        for url in self.start_urls:
            yield SplashRequest(url=url, callback = self.parse, args={"wait" : 3})

    def parse(self, response):

        companylink = response.css('div.tiles.js-portfolio-tiles a::attr(href)').extract()
        for i in companylink:
            yield response.follow('https://www.ta.com' + str(i), callback=self.parse1)

    def parse1(self, response):

        companyname = response.css('h1.item_detail-main-info-heading::text').extract_first()
        sectors = response.css('.item_detail-main-info-group-item::text')[0].extract()
        investmentyear = response.css('.item_detail-main-info-group-item::text')[1].extract()
        status = response.css('.item_detail-main-info-group-item::text')[2].extract()
        location = response.css('.item_detail-main-info-group-item::text')[3].extract()
        region = response.css('.item_detail-main-info-group-item::text')[4].extract()
        team = response.css('div.item_detail-main-info-group a::text').extract()
        about_company = response.css('h2.item_detail-main-content-heading::text').extract()
        about_company_detail = response.css('div.markdown p::text').extract()

        yield {
        'companyname': companyname,
        'sectors': sectors,
        'investmentyear': investmentyear,
        'status': status,
        'location': location,
        'region': region,
        'team': team,
        'about_company': about_company,
        'about_company_detail' : about_company_detail
        }

你在问什么?我在试着理解我在让scrapy进入下一个页面时出错的地方,并爬到下一个页面,直到没有更多的页面可以进入。使用此代码,刮取会在第一页本身停止@RainbI guess,你需要更好地表述你的问题,不清楚为什么没有更多的上下文它就不起作用,或者做一个非常具体的例子,否则很难帮助:(解决问题,无论如何,谢谢!然后将其作为答案发布。