Python Scrapy跟踪所有链接并获取状态
我想按照网站的所有链接,并获得像404200每个链接的状态。我试过这个:Python Scrapy跟踪所有链接并获取状态,python,scrapy,Python,Scrapy,我想按照网站的所有链接,并获得像404200每个链接的状态。我试过这个: from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors import LinkExtractor class someSpider(CrawlSpider): name = 'linkscrawl' item = [] allowed_domains = ['mysite.com'] s
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
class someSpider(CrawlSpider):
name = 'linkscrawl'
item = []
allowed_domains = ['mysite.com']
start_urls = ['//mysite.com/']
rules = (Rule (LinkExtractor(), callback="parse_obj", follow=True),
)
def parse_obj(self,response):
item = response.url
print(item)
我可以在控制台上看到没有状态代码的链接,如:
mysite.com/navbar.html
mysite.com/home
mysite.com/aboutus.html
mysite.com/services1.html
mysite.com/services3.html
mysite.com/services5.html
但如何将所有链接的状态保存到文本文件中?我解决了这个问题,如下所示。希望这将有助于任何需要帮助的人
import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
class LinkscrawlItem(scrapy.Item):
# define the fields for your item here like:
link = scrapy.Field()
attr = scrapy.Field()
class someSpider(CrawlSpider):
name = 'linkscrawl'
item = []
allowed_domains = ['mysite.com']
start_urls = ['//www.mysite.com/']
rules = (Rule (LinkExtractor(), callback="parse_obj", follow=True),
)
def parse_obj(self,response):
#print(response.status)
item = LinkscrawlItem()
item["link"] = str(response.url)+":"+str(response.status)
# item["link_res"] = response.status
# status = response.url
# item = response.url
# print(item)
filename = 'links.txt'
with open(filename, 'a') as f:
f.write('\n'+str(response.url)+":"+str(response.status)+'\n')
self.log('Saved file %s' % filename)
当我复制您的代码时,我得到:“没有名为'scrapy.contrib'的模块,您使用的是哪个版本?此链接可能会帮助您: