Python 抓取URL错误_Python_Python 2.7_Scrapy

Python 抓取URL错误

python python-2.7 scrapy

Python 抓取URL错误,python,python-2.7,scrapy,Python,Python 2.7,Scrapy,嗨，我是Python和Scrapy的新手，我试图编写一个爬行器，但在处理规则链接时，我找不到错误的位置或错误的解决方案我不知道这是编码还是相对路径或其他方面的问题当我运行脚本时，开始url有94个项目链接要遵循，正如您所看到的，我得到了94个“spider_异常/ValueError” 错误代码： ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control charact

嗨，我是Python和Scrapy的新手，我试图编写一个爬行器，但在处理规则链接时，我找不到错误的位置或错误的解决方案

我不知道这是编码还是相对路径或其他方面的问题

当我运行脚本时，开始url有94个项目链接要遵循，正如您所看到的，我得到了94个“spider_异常/ValueError”

错误代码：

ValueError: All strings must be XML compatible: Unicode or ASCII, no
NULL bytes or control characters 2017-07-17 16:12:49
[scrapy.core.scraper] ERROR: Spider error processing <GET
https://subastas.boe.es/detalleSubasta.php?idSub=SUB-JA-2017-68197&idBus=_VDFMQktMNXdpU0loK3B1UjZhMzhzUHdTUmdiTW9DNjBhM3lkMWpZWDBGbXdtOEVmWW13VmlhSC8vQUR5V1RNRjY0NWhVcjd2aDRMbkVyMkFLbmN4Ym0wc1E4eHVHWHlxSURJSTVBeGhzNGFIRzNkOUpBbW9SRG5RZExsbUNNeFFORSs1R21vaEJIeVhrMkdKdGRYUzg5N1laT2NPUTBwYUI0SVlHTm8vRkF4UEpleHE0b2U2MmZTdFhvZlIyUzgyemg0ekhOSEVoWEtuaVFMbXdBei92MytWaXNhWGtUTVd4SDJZUk9KUUJpVnExa01TeUhOcGZFQ1JqZDIxVU9BTWpHMGJVRU9rNmljVVN4UFFkNUp4SG1FR3dYWGlrVGgxWVJnWkRIQVJXZWxadVRpYWRUcm81WUgxeW4xb3RxQWJXV3JSNUl1N0NYZFoyVlhDaldGWU5RPT0,>
(referer:
https://subastas.boe.es/subastas_ava.php?campo%5B0%5D=SUBASTA.ORIGEN&dato%5B0%5D=&campo%5B1%5D=SUBASTA.ESTADO&dato%5B1%5D=EJ&campo%5B2%5D=BIEN.TIPO&dato%5B2%5D=I&dato%5B3%5D=501&campo%5B4%5D=BIEN.DIRECCION&dato%5B4%5D=&campo%5B5%5D=BIEN.CODPOSTAL&dato%5B5%5D=&campo%5B6%5D=BIEN.LOCALIDAD&dato%5B6%5D=&campo%5B7%5D=BIEN.COD_PROVINCIA&dato%5B7%5D=28&campo%5B8%5D=SUBASTA.POSTURA_MINIMA_MINIMA_LOTES&dato%5B8%5D=&campo%5B9%5D=SUBASTA.NUM_CUENTA_EXPEDIENTE_1&dato%5B9%5D=&campo%5B10%5D=SUBASTA.NUM_CUENTA_EXPEDIENTE_2&dato%5B10%5D=&campo%5B11%5D=SUBASTA.NUM_CUENTA_EXPEDIENTE_3&dato%5B11%5D=&campo%5B12%5D=SUBASTA.NUM_CUENTA_EXPEDIENTE_4&dato%5B12%5D=&campo%5B13%5D=SUBASTA.NUM_CUENTA_EXPEDIENTE_5&dato%5B13%5D=&campo%5B14%5D=SUBASTA.ID_SUBASTA_BUSCAR&dato%5B14%5D=&campo%5B15%5D=SUBASTA.FECHA_FIN_YMD&dato%5B15%5D%5B0%5D=&dato%5B15%5D%5B1%5D=&campo%5B16%5D=SUBASTA.FECHA_INICIO_YMD&dato%5B16%5D%5B0%5D=&dato%5B16%5D%5B1%5D=&page_hits=1000&sort_field%5B0%5D=SUBASTA.FECHA_FIN_YMD&sort_order%5B0%5D=desc&sort_field%5B1%5D=SUBASTA.FECHA_FIN_YMD&sort_order%5B1%5D=asc&sort_field%5B2%5D=SUBASTA.HORA_FIN&sort_order%5B2%5D=asc&accion=Buscar)

代码：

Spyder.py

# -*- coding: utf-8 -*-

import scrapy
from scrapy.spider import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.exceptions import CloseSpider
from boe.items import boeItem

class boeSpider(CrawlSpider):
    name = 'boe'
    item_count = 0
    allowed_domain = ['https://subastas.boe.es']
    start_urls = ['https://subastas.boe.es/subastas_ava.php?campo[0]=SUBASTA.ORIGEN&dato[0]=&campo[1]=SUBASTA.ESTADO&dato[1]=EJ&campo[2]=BIEN.TIPO&dato[2]=I&dato[3]=501&campo[4]=BIEN.DIRECCION&dato[4]=&campo[5]=BIEN.CODPOSTAL&dato[5]=&campo[6]=BIEN.LOCALIDAD&dato[6]=&campo[7]=BIEN.COD_PROVINCIA&dato[7]=28&campo[8]=SUBASTA.POSTURA_MINIMA_MINIMA_LOTES&dato[8]=&campo[9]=SUBASTA.NUM_CUENTA_EXPEDIENTE_1&dato[9]=&campo[10]=SUBASTA.NUM_CUENTA_EXPEDIENTE_2&dato[10]=&campo[11]=SUBASTA.NUM_CUENTA_EXPEDIENTE_3&dato[11]=&campo[12]=SUBASTA.NUM_CUENTA_EXPEDIENTE_4&dato[12]=&campo[13]=SUBASTA.NUM_CUENTA_EXPEDIENTE_5&dato[13]=&campo[14]=SUBASTA.ID_SUBASTA_BUSCAR&dato[14]=&campo[15]=SUBASTA.FECHA_FIN_YMD&dato[15][0]=&dato[15][1]=&campo[16]=SUBASTA.FECHA_INICIO_YMD&dato[16][0]=&dato[16][1]=&page_hits=1000&sort_field[0]=SUBASTA.FECHA_FIN_YMD&sort_order[0]=desc&sort_field[1]=SUBASTA.FECHA_FIN_YMD&sort_order[1]=asc&sort_field[2]=SUBASTA.HORA_FIN&sort_order[2]=asc&accion=Buscar']

    rules = {
        # Para cada item
        Rule(LinkExtractor(allow = (), restrict_xpaths = ("//a[contains(@class,'resultado-busqueda-link-defecto')]")),
                            callback = 'parse_item', follow = False)
    }

    def parse_item(self, response):
        DATAQ = boeItem()
        #info de General
        DATAQ['Gen_Id'] = response.xpath('//th[text()="Identificador"]/following-sibling::td[1]/strong/text()').extract_first()
        DATAQ['Gen_Tipo'] = response.xpath('//th[text()="Tipo de subasta"]/following-sibling::td[1]/strong/text()').extract()
        DATAQ['Gen_Inicio'] = response.xpath('//th[text()="Fecha de inicio"]/following-sibling::td[1]/span/text()').extract()
        DATAQ['Gen_Fin'] = response.xpath('//th[text()="Fecha de conclusión"]/following-sibling::td[1]/span/text()').extract()
        DATAQ['Gen_Deuda'] = response.xpath('//th[text()="Cantidad reclamada"]/following-sibling::td[1]/text()').extract()
        DATAQ['Gen_Lotes'] = response.xpath('//th[text()="Lotes"]/following-sibling::td[1]/text()').extract()
        DATAQ['Gen_Anuncio'] = response.xpath('//th[text()="Anuncio BOE"]/following-sibling::td[1]/a/@href').extract()
        DATAQ['Gen_Valor'] = response.xpath('//th[text()="Valor subasta"]/following-sibling::td[1]/text()').extract()
        DATAQ['Gen_Tasacion'] = response.xpath('//th[text()="Tasación"]/following-sibling::td[1]/text()').extract()
        DATAQ['Gen_Minimo'] = response.xpath('//th[text()="Puja mínima"]/following-sibling::td[1]/text()').extract_first()
        DATAQ['Gen_Tramos'] = response.xpath('//th[text()="Tramos entre pujas"]/following-sibling::td[1]/text()').extract_first()
        DATAQ['Gen_Deposito'] = response.xpath('//th[text()="Importe del depósito"]/following-sibling::td[1]/text()').extract()

        self.item_count += 1
        if self.item_count > 10:
            raise CloseSpider('item_exceeded')
        yield DATAQ

Settings.py：

 # -*- coding: utf-8 -*-

# Scrapy settings for boe project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     http://doc.scrapy.org/en/latest/topics/settings.html
#     http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
#     http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'boe'

SPIDER_MODULES = ['boe.spiders']
NEWSPIDER_MODULE = 'boe.spiders'

#CSV IMPORTACION
ITEM_PIPELINES = {'boe.pipelines.boePipeline': 500, }


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'boe (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

提前感谢您。

建议您分享您看到的错误的全部跟踪信息。我怀疑您的错误来自于在XPath调用中使用Unicode字符串，而没有显式的

前缀（使用Python 2）。因此，例如

“//th[text（）=“Tasación”]/以下同级：：td[1]/text（）”

应该是

response.xpath（u'//th[text（）=“Tasación”]/以下同级：：td[1]/text（））。extract（）

感谢paul的评论。我现在正在迁移到python3.6，因为我需要在scraper中使用大量unicode字符串，并且认为这个版本可以更好地处理它。我希望明天我能分享一个完整的跟踪，如果错误仍然存在。我切换到3.6和脚本的工作风格。非常感谢你的提示。

 # -*- coding: utf-8 -*-

# Scrapy settings for boe project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     http://doc.scrapy.org/en/latest/topics/settings.html
#     http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
#     http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'boe'

SPIDER_MODULES = ['boe.spiders']
NEWSPIDER_MODULE = 'boe.spiders'

#CSV IMPORTACION
ITEM_PIPELINES = {'boe.pipelines.boePipeline': 500, }


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'boe (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False