Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/349.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 不同数量的url返回_Python_Scrapy - Fatal编程技术网

Python 不同数量的url返回

Python 不同数量的url返回,python,scrapy,Python,Scrapy,我已经在一个固定域中构建了一个craws爬虫程序,并提取与一个固定正则表达式匹配的url。如果看到一个特定的url,爬虫将跟随该链接。爬虫程序可以完美地提取url,但每次我运行爬虫程序时,它都会返回不同数量的链接,即每次我运行它时,链接的数量都不同。我用刮痧来爬。这是不是刮皮的问题?代码是: class MySpider(CrawlSpider): name = "xyz" allowed_domains = ["xyz.nl"] start_urls = ["http://w

我已经在一个固定域中构建了一个craws爬虫程序,并提取与一个固定正则表达式匹配的url。如果看到一个特定的url,爬虫将跟随该链接。爬虫程序可以完美地提取url,但每次我运行爬虫程序时,它都会返回不同数量的链接,即每次我运行它时,链接的数量都不同。我用刮痧来爬。这是不是刮皮的问题?代码是:

class MySpider(CrawlSpider):
   name = "xyz"
   allowed_domains = ["xyz.nl"]
   start_urls = ["http://www.xyz.nl/Vacancies"] 
   rules = (Rule(SgmlLinkExtractor(allow=[r'\/V-\d{7}\/[\w\S]+']), callback='parse_item'),Rule(SgmlLinkExtractor(allow=[r'\?page\=\d+\&sortCriteria\=1']), follow=True),)



 def parse_item(self, response):

  outputfile = open('urllist.txt','a')
  print response.url
  outputfile.write(response.url+'\n')

不要在
parse_item()
方法中手动写入链接并使用
a
模式打开文件,而是使用scrapy的内置功能。使用链接字段定义项目:

from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.item import Item, Field


class MyItem(Item):
    url = Field()


class MySpider(CrawlSpider):
    name = "xyz"
    allowed_domains = ["xyz.nl"]
    start_urls = ["http://www.xyz.nl/Vacancies"]
    rules = (Rule(SgmlLinkExtractor(allow=[r'\/V-\d{7}\/[\w\S]+']), callback='parse_item'),
             Rule(SgmlLinkExtractor(allow=[r'\?page\=\d+\&sortCriteria\=1']), follow=True),)

    def parse_item(self, response):
        item = MyItem()
        item['url'] = response.url
        yield item

像往常一样:请输入代码?@sshashank124,好了!!