Python Selenium DOM searcher返回页面正文而不是WebElement_Python_Selenium_Dom_Web Scraping_Multiprocessing

Python Selenium DOM searcher返回页面正文而不是WebElement

python selenium dom web-scraping

Python Selenium DOM searcher返回页面正文而不是WebElement,python,selenium,dom,web-scraping,multiprocessing,Python,Selenium,Dom,Web Scraping,Multiprocessing,我正在为Techcrunch、Bloomberg等新闻网站开发一个爬虫程序，所有这些网站都有一个类似的模式，只需点击“加载更多”类型的按钮，就可以懒散地加载文章卡片我将其设计为使用多处理并行运行加载过程和摘要过程。对于上下文，下面的run方法位于Crawler类中，该类用于抽象不同的站点元素，因此无需为每个站点编写scraper。以下是输入方法： def run(self): """ Runs a crawler. """ binary: FirefoxBinary = Fir

我正在为Techcrunch、Bloomberg等新闻网站开发一个爬虫程序，所有这些网站都有一个类似的模式，只需点击“加载更多”类型的按钮，就可以懒散地加载文章卡片

我将其设计为使用

多处理

并行运行加载过程和摘要过程。对于上下文，下面的

run

方法位于

Crawler

类中，该类用于抽象不同的站点元素，因此无需为每个站点编写scraper。以下是输入方法：

def run(self):
    """ Runs a crawler. """
    binary: FirefoxBinary = FirefoxBinary(firefox_path="/usr/bin/firefox")
    self.driver: Firefox = Firefox(firefox_binary=binary)
    self.driver.get(self.url)

    self.load_pipe, self.digest_pipe = Pipe()

    load_proc: Process = Process(target=self._load_content)
    load_proc.start()

    digest_proc: Process = Process(target=self._digest_content)
    digest_proc.start()

问题出现在加载过程中，该加载过程通过

\u load\u content

方法实现。特别是在第一行中，通过调用

find\u element\u by\u class\u name

def _load_content(self):
    """ Loads more content. """
    loader: WebElement = self.driver.find_element_by_class_name(self.loader_name)
    ...

当以非并行方式同步测试时，函数返回一个表示目标按钮的

WebElement

。但是，当并行运行时，它返回一个表示整个页面正文的

str

，然后抛出

AttributeError:“str”对象没有属性“click”

我确保驱动程序在

\u load\u content

中时仍然完好无损，但该方法仍然返回

str

，而不是

WebElement

。奇怪的是，如果找不到具有给定类标识符的元素，它会引发

NoSuchElementException

。那么，为什么它以

str

的形式返回HTML正文呢？我错过了什么？

multiprocessing

是否会以某种方式破坏驱动程序API？

由于浏览器本身的限制，

WebDriver

API不是线程安全的。浏览器一次需要一个命令，因此进程必须以非并行方式同步运行。即使您有足够的资源来执行此操作，运行多个浏览器实例也无法解决问题，因为状态不会共享

一个潜在的解决方案是在load和digest过程之间实现回调结构。像这样（伪代码）：

find_element_by_class_name

中的故障很可能是由于驱动程序实例的状态已损坏或浏览器绑定无法按API所期望的方式运行而导致的

while article cards are available
    digest article cards

if no article cards are available
    load more article cards
    start digesting article cards again