Web scraping 如何从网站上获取所有数据?
我的代码只提供44个链接数据,而不是102个链接数据。有人能告诉我为什么它会这样提取吗?我会感谢你的帮助。我如何才能正确提取它Web scraping 如何从网站上获取所有数据?,web-scraping,beautifulsoup,scrapy,Web Scraping,Beautifulsoup,Scrapy,我的代码只提供44个链接数据,而不是102个链接数据。有人能告诉我为什么它会这样提取吗?我会感谢你的帮助。我如何才能正确提取它 import scrapy class ProjectItem(scrapy.Item): title = scrapy.Field() owned = scrapy.Field() Revenue2014 = scrapy.Field() Revenue2015 = scrapy.Field() Website = scrapy
import scrapy
class ProjectItem(scrapy.Item):
title = scrapy.Field()
owned = scrapy.Field()
Revenue2014 = scrapy.Field()
Revenue2015 = scrapy.Field()
Website = scrapy.Field()
Rank = scrapy.Field()
Employees = scrapy.Field()
headquarters = scrapy.Field()
FoundedYear = scrapy.Field()
类ProjectSpider(scrapy.Spider):
您的XPath存在一些潜在问题:
# iterate all paragraphs within the article:
for para in response.xpath("//*[@itemprop='articleBody']/p"):
url = para.xpath("./a/@href").extract()
# ... etc
len(response.xpath(“//*[@itemprop='articleBody']/p”)
顺便说一下,它给出了预期的102
您可能需要过滤URL以删除非公司URL,如标有“单击或点击此处”的上的。仔细查看scrapy的输出,您会发现在几十个请求后,它们会被重定向,如下所示:
DEBUG: Redirecting (302) to <GET http://www.cincinnati.com/get-access/?return=http%3A%2F%2Fwww.cincinnati.com%2Fstory%2Fmoney%2F2016%2F11%2F27%2Ffrischs-restaurants%2F94430718%2F> from <GET http://www.cincinnati.com/story/money/2016/11/27/frischs-restaurants/94430718/>
DEBUG:将(302)重定向到
收到请求的页面上写着:我们希望您享受免费访问。
因此,他们似乎只向匿名用户提供有限的访问权限。您可能需要注册到他们的服务才能完全访问数据
DEBUG: Redirecting (302) to <GET http://www.cincinnati.com/get-access/?return=http%3A%2F%2Fwww.cincinnati.com%2Fstory%2Fmoney%2F2016%2F11%2F27%2Ffrischs-restaurants%2F94430718%2F> from <GET http://www.cincinnati.com/story/money/2016/11/27/frischs-restaurants/94430718/>