Warning: file_get_contents(/data/phpspider/zhask/data//catemap/8/selenium/4.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 转到下一页,但它不会使用Selenium和Scrapy刮取元素_Python_Selenium_Web Scraping_Scrapy - Fatal编程技术网

Python 转到下一页,但它不会使用Selenium和Scrapy刮取元素

Python 转到下一页,但它不会使用Selenium和Scrapy刮取元素,python,selenium,web-scraping,scrapy,Python,Selenium,Web Scraping,Scrapy,我正在尝试使用Selenium刮取所有页面并单击“下一页”按钮。然而,当我移动到下一页时,URL并没有改变。我可以移动到所有页面,但我只得到从第一页刮下来的项目,不知道如何使其适用于所有页面。 对我应该做什么有什么建议吗 提前谢谢你 守则: class MilieuProperties(scrapy.Spider): name = 'milieu_properties' start_urls = [ # FOR SALE 'https://www.

我正在尝试使用Selenium刮取所有页面并单击“下一页”按钮。然而,当我移动到下一页时,URL并没有改变。我可以移动到所有页面,但我只得到从第一页刮下来的项目,不知道如何使其适用于所有页面。 对我应该做什么有什么建议吗

提前谢谢你

守则:

class MilieuProperties(scrapy.Spider):
    name = 'milieu_properties'
    start_urls = [
        # FOR SALE
        'https://www.milieuproperties.com/search-results.aspx?paramb=ADVANCE%20SEARCH:%20Province%20(Western%20Cape),%20%20Area%20(Cape%20Town)',
        'https://www.milieuproperties.com/RentalByCategory.aspx'
    ]

    def __init__(self):


        
        #headless options
        options = Options()
        options.add_argument('--no-sandbox')
        options.add_argument("--headless")
        options.add_experimental_option("excludeSwitches", ["enable-automation"])
        options.add_experimental_option('useAutomationExtension', False)
        self.driver = webdriver.Chrome('path',options=options)

    
    def parse(self,response):
        self.driver.get(response.url)
        current_page_number = self.driver.find_element_by_css_selector('#ContentPlaceHolder1_lvDataPager1>span').text
        while True:
            try: 
                elem = WebDriverWait(self.driver, 10).until(EC.element_to_be_clickable((By.XPATH, '//*[@id="ContentPlaceHolder1_lvDataPager1"]/a[text()="Next" and not(@class)]')))
                elem.click()
            except TimeoutException:
                break
            WebDriverWait(self.driver, 10).until(lambda driver: self.driver.find_element_by_css_selector('#ContentPlaceHolder1_lvDataPager1>span').text != current_page_number)
            current_page_number = self.driver.find_element_by_css_selector('#ContentPlaceHolder1_lvDataPager1>span').text


        offering = response.css('span#ContentPlaceHolder1_lblbreadcum::text').get()
        try:
            offering = 'rent' if 'Rental' in offering else 'buy'
        except TypeError:
            offering = 'buy'

        base_link = response.request.url.split('/')
        try:
            base_link = base_link[0] + '//' + base_link[2] + '/'
        except:
            pass

        for p in response.xpath('//div[@class="ct-itemProducts ct-u-marginBottom30 ct-hover"]'):
            link = base_link + p.css('a::attr(href)').get()

            yield scrapy.Request(
                link,
                callback=self.parse_property,
                meta={'item': {
                    'url': link,
                    'offering': offering,
                    }},
            )


    def parse_property(self, response):
        item = response.meta.get('item')
        . . .

您可以不使用Scrapy获取数据。请尝试以下代码:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Chrome()

links = []
url = 'https://www.milieuproperties.com/search-results.aspx?paramb=ADVANCE%20SEARCH:%20Province%20(Western%20Cape),%20%20Area%20(Cape%20Town)'
driver.get(url)
current_page_number = driver.find_element_by_css_selector('#ContentPlaceHolder1_lvDataPager1>span').text
while True:
    links.extend([link.get_attribute('href') for link in driver.find_elements_by_css_selector('.hoverdetail a')])
    try: 
        elem = WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, '//*[@id="ContentPlaceHolder1_lvDataPager1"]/a[text()="Next" and not(@class)]')))
        elem.click()
    except TimeoutException:
        break
    WebDriverWait(driver, 10).until(lambda driver: driver.find_element_by_css_selector('#ContentPlaceHolder1_lvDataPager1>span').text != current_page_number)
    current_page_number = driver.find_element_by_css_selector('#ContentPlaceHolder1_lvDataPager1>span').text

print(links)

最简单的解决方法就是不要将Scrapy与硒一起使用。您可以使用Selenium获取所需的所有数据only@JaSON我只会对属性页发出Scrapy请求,我不太明白为什么如果您单击Selenium中的“下一步”按钮,并且URL没有更改,则无法通过使用Scrapy请求相同的页面HTML来获得所需的数据,你需要把饼干从Selenium传给Scrapy。但这似乎是多余的操作,因为您已经在所需页面上,可以直接使用selenium代码获取数据。简单地说,Scrapy和Selenium之间并没有同步,所以当您使用Selenium转到下一页时,Scrapy并不“知道”关于Scrapy的信息it@JaSON我不确定如何获取所有指向属性页的链接并继续从中收集数据?您不需要任何链接和HTTP请求。只需单击“下一步”按钮即可加载新的HTML DOM,并使用Selenium内置的方法/属性刮取所需的数据非常感谢!有没有一种方法,一旦我有了这些链接,就可以访问这些链接并删除它们?这就是我以前用刮痧的原因@JaSON@saraherceg您只需迭代链接列表,并在links:driver.get(link)中逐个打开链接。