Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/348.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
scrapy python递归查找href引用_Python_Scrapy_Href - Fatal编程技术网

scrapy python递归查找href引用

scrapy python递归查找href引用,python,scrapy,href,Python,Scrapy,Href,我正在尝试从起始页查找并打印所有href: class Ejercicio2(scrapy.Spider): name = "Ejercicio2" Ejercicio2 = {} category = None lista_urls =[] #defino una lista para meter las urls def __init__(self, *args, **kwargs): super(Ejercicio2, self).__init__

我正在尝试从起始页查找并打印所有href:

class Ejercicio2(scrapy.Spider):
    name = "Ejercicio2"
    Ejercicio2 = {}
    category = None
    lista_urls =[] #defino una lista para meter las urls

def __init__(self, *args, **kwargs):
    super(Ejercicio2, self).__init__(*args, **kwargs)
    self.start_urls = ['http://www.masterdatascience.es/']
    self.allowed_domains = ['www.masterdatascience.es/']
    url = ['http://www.masterdatascience.es/']


def parse(self, response):
    print(response)
    # hay_enlace=response.css('a::attr(href)')
    # if hay_enlace:
    links = response.xpath("a/@href")
    for el in links:
        url = response.css('a::attr(href)').extract()
        print(url)
        next_url = response.urljoin(el.xpath("a/@href").extract_first())
        print(next_url)
        print('pasa por aqui')
        yield scrapy.Request(url, self.parse())
        # yield scrapy.Request(next_url, callback=self.parse)
        print(next_url)

但是没有按预期工作,没有遵循遇到的href引用,只有第一个引用。

您可以尝试将xpath修改为//a/@href

下面的代码将打印出页面上的所有href:

import scrapy

class stackoverflow20170129Spider(scrapy.Spider):
    name = "stackoverflow20170129"
    allowed_domains = ["masterdatascience.es"]
    start_urls = ["http://www.masterdatascience.es/",]

    def parse(self, response):
        for href in response.xpath('//a/@href'):
           url = response.urljoin(href.extract())
           print url
#           yield scrapy.Request(url, callback=self.parse_dir_contents)

还有一件事:从允许的_域删除www.是值得的-如果你深入到网站,开始访问诸如anewpage.masterdatascience.es之类的页面,那么包括www.将阻止该页面

你能尝试删除后面的/允许的_域吗?self.allowed_domains=['www.masterdatascience.es']