Python 刮擦比较数据_Python_Scrapy

Python 刮擦比较数据

python scrapy

Python 刮擦比较数据,python,scrapy,Python,Scrapy,我对scrapy非常陌生，在我的项目中，我不确定如何继续。我的想法是，我想删掉hackernews的前两页，打印出所有得分超过300的文章/标题。基于我有限的知识，下面的代码是我能够找出如何获取所需信息的最佳方法。我的最终目标是，我需要将id与post id进行比较以匹配它们，将点数添加到相应的匹配项中，然后过滤出小于300的点数。我不知道如何比较我所搜集到的字典值。代码如下： import scrapy class ArticlesSpider(scrapy.Spider): nam

我对scrapy非常陌生，在我的项目中，我不确定如何继续。我的想法是，我想删掉hackernews的前两页，打印出所有得分超过300的文章/标题。基于我有限的知识，下面的代码是我能够找出如何获取所需信息的最佳方法。我的最终目标是，我需要将id与post id进行比较以匹配它们，将点数添加到相应的匹配项中，然后过滤出小于300的点数。我不知道如何比较我所搜集到的字典值。代码如下：

import scrapy

class ArticlesSpider(scrapy.Spider):
    name = 'articles'
    start_urls = [
        'https://news.ycombinator.com',
        # 'https://news.ycombinator.com/news?p=2'
    ]

    def parse(self, response):
        link = response.css('tr.athing')
        score = response.css('td.subtext')
        for website in link:
            yield {
                'title': website.css('tr.athing td.title a.storylink::text').get(),
                'link':  website.css('tr.athing td.title a::attr(href)').get(),
                'id': website.css('tr::attr(id)').get(),
            }
        for points in score:
            yield {
                'post_id': points.css('span::attr(id)').get(),
                'points': points.css('span.score::text').get()
            }

有没有更好的方法来实现我想做的事情？

这些

帖子和分数列表的长度和顺序都相同
在每次迭代中，检查相应帖子的得分是否为>=300

import scrapy

class ArticlesSpider(scrapy.Spider):
    name = 'articles'
    start_urls = [
        'https://news.ycombinator.com',
        #'https://news.ycombinator.com/news?p=2'
    ]

    def parse(self, response):
        posts = response.css('tr.athing')
        scores = response.css('td.subtext')
        
        for i in range(len(posts)):
            # get the post
            post = posts[i]
        
            # get the score point of the corresponding post
            score = scores[i]
            score_point = score.css('span.score::text').get()
            # handle some post has no score point
            score_point = int(score_point.split(' ')[0]) if score_point else 0
            
            if score_point >= 300:
                yield {
                    'title': post.css('tr.athing td.title a.storylink::text').get(),
                    'link':  post.css('tr.athing td.title a::attr(href)').get(),
                    'id': post.css('tr::attr(id)').get(),
                    'points': score_point
                }

分数>=300
的帖子将被打印：
{'title': 'One man's fight for the right to repair broken MacBooks', 'link': 'https://columbianewsse
rvice.com/2021/05/21/one-mans-fight-for-the-right-to-repair-broken-macbooks/', 'id': '27254719', 'po
ints': 1138}
2021-05-24 18:14:28 [scrapy.core.scraper] DEBUG: Scraped from <200 https://news.ycombinator.com>
{'title': 'Why I prefer making useless stuff', 'link': 'https://web.eecs.utk.edu/~azh/blog/makinguse
lessstuff.html', 'id': '27256867', 'points': 604}
2021-05-24 18:14:28 [scrapy.core.scraper] DEBUG: Scraped from <200 https://news.ycombinator.com>
{'title': 'Why Decentralised Applications Don't Work', 'link': 'https://ingrids.space/posts/why-dist
ributed-systems-dont-work/', 'id': '27259321', 'points': 320}
2021-05-24 18:14:28 [scrapy.core.scraper] DEBUG: Scraped from <200 https://news.ycombinator.com>
{'title': 'Freesound just reached 500K Creative Commons sounds', 'link': 'https://blog.freesound.org
/?p=1340', 'id': '27232297', 'points': 696}
2021-05-24 18:14:28 [scrapy.core.scraper] DEBUG: Scraped from <200 https://news.ycombinator.com>
{'title': 'The Limits to Blockchain Scalability', 'link': 'https://vitalik.ca/general/2021/05/23/sca
ling.html', 'id': '27257641', 'points': 378}
2021-05-24 18:14:28 [scrapy.core.scraper] DEBUG: Scraped from <200 https://news.ycombinator.com>
{'title': 'Teardown of a PC Power Supply', 'link': 'https://www.righto.com/2021/05/teardown-of-pc-po
wer-supply.html', 'id': '27256515', 'points': 351}
2021-05-24 18:14:28 [scrapy.core.scraper] DEBUG: Scraped from <200 https://news.ycombinator.com>
{'title': 'Dorodango: the Japanese art of making shiny mud balls (2019)', 'link': 'https://www.laure
nceking.com/blog/2019/09/26/dorodango-blog/', 'id': '27255755', 'points': 313}

{'title'：'One man's wight for right to repair break macbook'，'link'：'https://columbianewsse
service.com/2021/05/21/一名男子为维修损坏的macbooks而斗争
整数：1138}
2021-05-24 18:14:28[scrapy.core.scraper]调试：从
{'title'：'为什么我喜欢做无用的东西'，'link'：'https://web.eecs.utk.edu/~azh/blog/makinguse
lesstuff.html'，'id'：'27256867'，'points'：604}
2021-05-24 18:14:28[scrapy.core.scraper]调试：从
{'title'：'为什么分散的应用程序不工作'，'link'：'https://ingrids.space/posts/why-dist
分布式系统不工作/'，'id'：'27259321'，'points'：320}
2021-05-24 18:14:28[scrapy.core.scraper]调试：从
{'title'：'Freesound刚刚达到500K知识共享声音'，'link'：'https://blog.freesound.org
/？p=1340'，'id'：'2723297'，'points'：696}
2021-05-24 18:14:28[scrapy.core.scraper]调试：从
{'title'：'区块链可伸缩性的限制'，'link'：'https://vitalik.ca/general/2021/05/23/sca
ling.html，'id'：'27257641'，'points'：378}
2021-05-24 18:14:28[scrapy.core.scraper]调试：从
{'title'：'PC电源的拆卸'，'link'：'https://www.righto.com/2021/05/teardown-of-pc-po
wer supply.html'，id:'27256515'，points:'351}
2021-05-24 18:14:28[scrapy.core.scraper]调试：从
{'title'：'Dorodango:日本制造闪亮泥球的艺术（2019）'，'link'：'https://www.laure
nceking.com/blog/2019/09/26/dorodango blog/，“id:”27255755，“points”：313}