Python 刮削:从2个级别刮削多个项目
我对刮痧还比较陌生,我正在为我的个人锻炼寻找解决方案。我想做的是抓取IMDB排行榜上的电影,以获得排名、标题、年份和情节。 我设法浏览链接并抓取电影页面,但我找不到一种方法来获得每部电影的排名 当前我的代码如下所示:Python 刮削:从2个级别刮削多个项目,python,scrapy,Python,Scrapy,我对刮痧还比较陌生,我正在为我的个人锻炼寻找解决方案。我想做的是抓取IMDB排行榜上的电影,以获得排名、标题、年份和情节。 我设法浏览链接并抓取电影页面,但我找不到一种方法来获得每部电影的排名 当前我的代码如下所示: import scrapy from tutorial.items import IMDB_dict # We need this so that Python knows about the item object class MppaddressesSpider(scrapy
import scrapy
from tutorial.items import IMDB_dict # We need this so that Python knows about the item object
class MppaddressesSpider(scrapy.Spider):
name = "mppaddresses" # The name of this spider
# The allowed domain and the URLs where the spider should start crawling:
allowed_domains = ["imdb.com"]
start_urls = ['https://www.imdb.com/chart/top/']
def parse(self, response):
# The main method of the spider. It scrapes the URL(s) specified in the
# 'start_url' argument above. The content of the scraped URL is passed on
# as the 'response' object.
for rank in response.xpath(".//tbody[@class='lister-list']/tr/td[@class='titleColumn']/text()").extract():
rank=" ".join(rank.split())
item = IMDB_dict()
item['rank'] = rank
for url in response.xpath(".//tbody[@class='lister-list']/tr/td[@class='titleColumn']/a/@href").extract():
# This loops through all the URLs found inside an element of class 'mppcell'
# Constructs an absolute URL by combining the response’s URL with a possible relative URL:
full_url = response.urljoin(url)
print("FOOOOOOOOOnd URL: "+full_url)
# The following tells Scrapy to scrape the URL in the 'full_url' variable
# and calls the 'get_details() method below with the content of this
# URL:
#yield {'namyy' : response.xpath(".//tbody[@class='lister-list']/tr/td[@class='titleColumn']/text()").extract().strip("\t\r\n '\""),}
yield scrapy.Request(full_url, callback=self.get_details)
def get_details(self, response):
# This method is called on by the 'parse' method above. It scrapes the URLs
# that have been extracted in the previous step.
#item = OntariomppsItem() # Creating a new Item object
# Store scraped data into that item:
item = IMDB_dict()
item['name'] = response.xpath(".//div[@class='title_bar_wrapper']/div[@class='titleBar']/div[@class='title_wrapper']/h1/text()").extract_first().strip("\t\r\n '\"")
item['phone'] = response.xpath(".//div[@class='titleBar']/div[@class='title_wrapper']/h1/span[@id='titleYear']/a/text()").extract_first().strip("\t\r\n '\"")
item['email'] = response.xpath(".//div[@class='plot_summary ']/div[@class='summary_text']/text()").extract_first().strip("\t\r\n '\"")
# Return that item to the main spider method:
yield item
此外,my item.py具有:
import scrapy
class IMDB_dict(scrapy.Item):
# define the fields for your item here like:
rank = scrapy.Field()
name = scrapy.Field()
phone = scrapy.Field()
email = scrapy.Field()
主要问题:如何获得与标题相关的排名
最后一个问题(如果可能):我可以像访问相对URL(使用urljoin)时那样访问URL,但我找不到访问绝对URL的方法
非常感谢你的帮助
最好,您需要使用
meta
向您的get\u details
回调发送rank
:
def parse(self, response):
for movie in response.xpath(".//tbody[@class='lister-list']/tr/td[@class='titleColumn']"):
movie_rank = movie.xpath('./text()').re_first(r'(\d+)')
movie_url = movie.xpath('./a/@href').extract_first()
movie_full_url = response.urljoin(movie_url)
print("FOOOOOOOOOnd URL: " + movie_url)
yield scrapy.Request(movie_full_url, callback=self.get_details, meta={"rank": movie_rank})
def get_details(self, response):
item = IMDB_dict()
item['rank'] = response.meta["rank"]
item['name'] = response.xpath(".//div[@class='title_bar_wrapper']/div[@class='titleBar']/div[@class='title_wrapper']/h1/text()").extract_first().strip("\t\r\n '\"")
item['phone'] = response.xpath(".//div[@class='titleBar']/div[@class='title_wrapper']/h1/span[@id='titleYear']/a/text()").extract_first().strip("\t\r\n '\"")
item['email'] = response.xpath(".//div[@class='plot_summary ']/div[@class='summary_text']/text()").extract_first().strip("\t\r\n '\"")
# Return that item to the main spider method:
yield item
更新
如果您检查日志,就会发现此错误
AttributeError:“非类型”对象没有属性“strip”
有时
.extract\u first()
返回None
,您不能strip()
它。我建议您使用非常感谢您的帮助。它工作得很好!它实际上回答了我的两个问题。然而,我似乎不明白为什么它不能抓取一些电影。我试过运行爬虫几次,但我错过了同样的电影。HTML代码似乎完全相同。数字22、34、44、65和其他一些未被爬网。你知道为什么吗?嗨,Gangabass,谢谢你的更新。我正在研究项目加载器,有很多东西需要学习。仅供参考,我试图更新你的答案,但似乎我对该网站太陌生,无法做到这一点。再次,非常感谢您的全面贡献。更新:事实上已经找到了22号、34号等未被爬网的原因。原因是IMDB,对于某些电影,更改了摘要文本的类并添加了一个新类。因此xpath不再正确,输出为空字符串。为了解决这个问题,我需要将摘要的xpath更改为“包含类”而不是“拥有类”。新xpath:item['email']=response.xpath(“.//div[@class='plot\u summary\u wrapper']/div[contains(@class='plot\u summary')]/div[@class='summary\u text']/text()”).extract\u first().strip(“\t\r\n'\”)