Python 用刮屑和硒刮CNN_Python_Selenium_Scrapy

Python 用刮屑和硒刮CNN

python selenium scrapy

Python 用刮屑和硒刮CNN,python,selenium,scrapy,Python,Selenium,Scrapy,我想创建一个高度自动化的刮板，它能够打开cnn.com的搜索结果页面（这就是为什么我需要Selenium），从每篇文章中提取一些信息，然后进入下一页，但是，到目前为止几乎没有成功目前我的代码是这样的（我知道，这可能很糟糕，这是我发现的其他spider的拼凑）： Chrome现在做的是打开第一页，然后几乎立即关闭它，什么也不做。有人能帮我把这些放在一起吗？很遗憾，这些都没有解决我的问题。我知道我的代码在很多地方可能是错误的，我只是不知道在哪里。 import scrapy from sc

我想创建一个高度自动化的刮板，它能够打开cnn.com的搜索结果页面（这就是为什么我需要Selenium），从每篇文章中提取一些信息，然后进入下一页，但是，到目前为止几乎没有成功

目前我的代码是这样的（我知道，这可能很糟糕，这是我发现的其他spider的拼凑）：

Chrome现在做的是打开第一页，然后几乎立即关闭它，什么也不做。有人能帮我把这些放在一起吗？

很遗憾，这些都没有解决我的问题。我知道我的代码在很多地方可能是错误的，我只是不知道在哪里。

    import scrapy
from scrapy import signals
from scrapy.http import TextResponse 
from scrapy.xlib.pydispatch import dispatcher
from cnn.items import CNNitem
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC


class CNNspider(CrawlSpider):
    name = "cnn_spider"
    allowed_domains = ['cnn.com']
    start_urls = ['https://www.cnn.com/search?q=elizabeth%20warren&size=10&page=1']
    rules = [
        Rule(LinkExtractor(restrict_xpaths='//div[@class="cnn-search__results-list"]//h3/a/@href'), callback='parse_post', follow= True),
        ]

    def __init__(self, *a, **kw):
        self.driver = webdriver.Chrome()
        super(CNNspider, self).__init__(*a, **kw)



    def parse_page(self, response):
        # selenium part of the job
        self.driver.get(response.url)
        while True:
            more_btn = WebDriverWait(self.driver, 10).until(
                EC.visibility_of_element_located((By.XPATH, "//div[@class='pagination-bar']/div[contains(text(), 'Next')]"))
            )

            more_btn.click()

            # stop when we reach the desired page
            if self.driver.current_url.endswith('page=161'):
                break

        # now scrapy should do the job
        response = TextResponse(url=response.url, body=self.driver.page_source, encoding='utf-8')
        for post in response.xpath('//div[@class="cnn-search__results-list"]/div[@class="cnn-search__result cnn-search__result--article"]'):
            item = CNNitem()
            item['Title'] = post.xpath('.//h3[@class="cnn-search__result-headline"]/a/text()').extract()
            item['Link'] = post.xpath('.//h3[@class="cnn-search__result-headline"]/a/@href').extract()

            yield scrapy.Request(item['Link'], meta={'item': item}, callback=self.parse_post)

    def parse_post(self, response):
        item = response.meta['item']
        item["Body"] = response.xpath('//section[@id="body-text"]/div[1]/div/text()').extract()
        return item