Python 刮擦式爬行器无法爬行所需页面
这是我试图抓取的网站链接。 下面是我的刮刀,因为这是第一次尝试刮刀,所以请原谅愚蠢的错误。请看一看,并建议任何修改,这将使我的代码运行 Items.pyPython 刮擦式爬行器无法爬行所需页面,python,scrapy,Python,Scrapy,这是我试图抓取的网站链接。 下面是我的刮刀,因为这是第一次尝试刮刀,所以请原谅愚蠢的错误。请看一看,并建议任何修改,这将使我的代码运行 Items.py import scrapy class EpfoCrawl2Item(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() from scrapy.item import Item, Field S
import scrapy
class EpfoCrawl2Item(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
from scrapy.item import Item, Field
S_No = Field()
Old_region_code = Field()
Region_code = Field()
Name = Field()
Address = Field()
Pin = Field()
Epfo_office = Field()
Under_Ro = Field()
Under_Acc = Field()
Payment = Field()
pass
import scrapy
from scrapy.selector import HtmlXPathSelector
class EpfoCrawlSpider(scrapy.Spider):
"""Spider for regularly updated search.epfoservices.in"""
name = "PfData"
allowed_domains = ["search.epfoservices.in"]
starturls = ["http://search.epfoservices.in/est_search_display_result.php?pageNum_search=1&totalRows_search=72045&old_rg_id=AP&office_name=&pincode=&estb_code=&estb_name=&paging=paging"]
def parse(self,response):
hxs = HtmlXPathSelector(response)
rows = hxs.select('//tr"]')
items = []
for val in rows:
item = Val()
item['S_no'] = val.select('/td[0]/text()').extract()
item['Old_region_code'] = val.select('/td[1]/text').extract()
item['Region_code'] = val.select('/td[2]/text()').extract()
item['Name'] = val.select('/td[3]/text()').extract()
item['Address'] = val.select('/td[4]/text()').extract()
item['Pin'] = val.select('/td[5]/text()').extract()
item['Epfo_office'] = val.select('/td[6]/text()').extract()
item['Under_ro'] = val.select('/td[7]/text()').extract()
item['Under_Acc'] = val.select('/td[8]/text()').extract()
item['Payment'] = val.select('a/@href').extract()
items.append(item)
yield items
epfocrawl1\u spider.py
import scrapy
class EpfoCrawl2Item(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
from scrapy.item import Item, Field
S_No = Field()
Old_region_code = Field()
Region_code = Field()
Name = Field()
Address = Field()
Pin = Field()
Epfo_office = Field()
Under_Ro = Field()
Under_Acc = Field()
Payment = Field()
pass
import scrapy
from scrapy.selector import HtmlXPathSelector
class EpfoCrawlSpider(scrapy.Spider):
"""Spider for regularly updated search.epfoservices.in"""
name = "PfData"
allowed_domains = ["search.epfoservices.in"]
starturls = ["http://search.epfoservices.in/est_search_display_result.php?pageNum_search=1&totalRows_search=72045&old_rg_id=AP&office_name=&pincode=&estb_code=&estb_name=&paging=paging"]
def parse(self,response):
hxs = HtmlXPathSelector(response)
rows = hxs.select('//tr"]')
items = []
for val in rows:
item = Val()
item['S_no'] = val.select('/td[0]/text()').extract()
item['Old_region_code'] = val.select('/td[1]/text').extract()
item['Region_code'] = val.select('/td[2]/text()').extract()
item['Name'] = val.select('/td[3]/text()').extract()
item['Address'] = val.select('/td[4]/text()').extract()
item['Pin'] = val.select('/td[5]/text()').extract()
item['Epfo_office'] = val.select('/td[6]/text()').extract()
item['Under_ro'] = val.select('/td[7]/text()').extract()
item['Under_Acc'] = val.select('/td[8]/text()').extract()
item['Payment'] = val.select('a/@href').extract()
items.append(item)
yield items
下面是运行“scrapy crawl PfData”后的日志
请提供建议。开始URL列表必须是
开始URL
,而不是开始URL
@abukaj No!代码审查需要工作代码。此代码无法正常工作。类EpfoCrawlSpider(scrapy.Spider):
之后缺少缩进也永远不会起作用。缩进在Python中很重要。@Mast不是检查源代码来调试它,而是检查代码?@abukaj也许这是检查代码的一个版本,但它是针对codereview.se的。如果您不熟悉SE站点的范围,请不要推荐它们。@Mast谢谢。我只检查了codereview.se的描述,它相当简短,而不是主题外策略。