Python 跟踪爬网链接
具有以下十字轴的:Python 跟踪爬网链接,python,web-scraping,scrapy,web-crawler,Python,Web Scraping,Scrapy,Web Crawler,具有以下十字轴的: import scrapy from final.items import FinalItem class ScrapeMovies(scrapy.Spider): name='final' start_urls = [ 'https://www.trekearth.com/members/page1.htm?sort_by=md' ] def parse(self, response): for row
import scrapy
from final.items import FinalItem
class ScrapeMovies(scrapy.Spider):
name='final'
start_urls = [
'https://www.trekearth.com/members/page1.htm?sort_by=md'
]
def parse(self, response):
for row in response.xpath('//table[@class="member-table"]//tr[position() > 1]'):
item = FinalItem()
item['name'] = row.xpath('./td[2]//a/text()').extract_first()
website = row.xpath('./td[2]//a/@href/text()').extract_first()
request = scrapy.Request(website,
callback=self.parse_page2)
yield request
def parse_page2(self, response):
request.meta['item'] = item
item['travelog'] = response.xpath('string(//div[@class="statistics-btm"]/ul//li[position()=4]/a)').extract_first()
yield item
# next_page=response.xpath('//div[@class="page-nav-btm"]/ul/li[last()]/a/@href').extract_first()
# if next_page is not None:
# next_page=response.urljoin(next_page)
# yield scrapy.Request(next_page, callback=self.parse)
我有一个我想从这个表中刮取姓名和其他信息,然后按照链接到每个用户配置文件,然后从这些配置文件中收集一些数据,然后将其合并到单个项目中
然后我想返回到主表并转到下一页,直到代码的最后一部分对此负责,为了方便起见,它被注释掉了
我写的代码不能正常工作。我的错误是:
TypeError: Request url must be str or unicode, got NoneType:
如何解决这个问题?如何正确抓取所有数据?您需要这段代码,因为您的XPath表达式错误:
def parse(self, response):
for row in response.xpath('//table[@class="member-table"]//tr[position() > 1]'):
item = FinalItem()
item['name'] = row.xpath('./td[2]//a/text()').extract_first()
profile_url = row.xpath('./td[2]//a/@href').extract_first()
yield scrapy.Request( url=response.urljoin(profile_url), callback=self.parse_profile, meta={"item": item } )
next_page_url = response.xpath('//div[@class="page-nav-btm"]//li[last()]/a/@href').extract_first()
if next_page_url:
yield scrapy.Request( url=response.urljoin(next_page_url), callback=self.parse )
def parse_profile(self, response):
item = response.meta['item']
item['travelog'] = response.xpath('//div[@class="statistics-btm"]/ul//li[ ./span[contains(., "Travelogues")] ]/a/text()').extract_first()
yield item
//div[@class=page nav btm]/ul/li[last]/a/@href为什么有last方法?你能检查一下吗?也许这不是一个解决方案,但无论如何,请检查它是否正确,因为在xpath中,指向下一页的箭头始终是此placeNameError中的最后一个箭头:未定义名称“item”-这是我收到的无数次。。。奇怪,因为在items.py中,所有内容都是正确的defined@mrowkacala哦,是复制/粘贴错误。请检查更新版本