Python 单击按钮，然后在新窗口中从新页面获取内容_Python_Scrapy_Splash Screen_Scrapy Splash

Python 单击按钮，然后在新窗口中从新页面获取内容

python scrapy

Python 单击按钮，然后在新窗口中从新页面获取内容,python,scrapy,splash-screen,scrapy-splash,Python,Scrapy,Splash Screen,Scrapy Splash,我面临的问题是，当我单击一个按钮时，Javascript处理该操作，然后它重定向到一个新窗口的新页面（类似于在目标为\u Blank的情况下单击）。在scrapy/splash中，我不知道如何从新页面获取内容（我的意思是我不知道如何控制新页面）任何人都可以帮忙 script = """ function main(splash) assert(splash:go(splash.args.url)) splash:wait(0.5) loc

我面临的问题是，当我单击一个按钮时，Javascript处理该操作，然后它重定向到一个新窗口的新页面（类似于在目标为

\u Blank

的情况下单击

）。在scrapy/splash中，我不知道如何从新页面获取内容（我的意思是我不知道如何控制新页面）

任何人都可以帮忙

script = """
    function main(splash)
        assert(splash:go(splash.args.url))
        splash:wait(0.5)
        local element = splash:select('div.result-content-columns div.result-title')
        local bounds = element:bounds()
        element:mouse_click{x=bounds.width/2, y=bounds.height/2}
        return splash:html()
    end
"""

def start_requests(self):
    for url in self.start_urls:
        yield SplashRequest(url, self.parse, endpoint='execute', args={'lua_source': self.script})

问题: 问题是您无法清除超出您选择范围的html。当一个新链接被点击时，如果有一个iframe涉及，它很少会将它带入刮取的范围

解决方案：选择选择新iframe的方法，然后继续解析新的html

刮擦飞溅法（这是Mikhail Korobov的解决方案的改编自）

如果您能够获得弹出的新页面的src链接，它可能是最可靠的，但是，您也可以尝试通过以下方式选择iframe：

# ...
    yield SplashRequest(url, self.parse_result, endpoint='render.json', 
                        args={'html': 1, 'iframes': 1})

def parse_result(self, response):
    iframe_html = response.data['childFrames'][0]['html']
    sel = parsel.Selector(iframe_html)
    item = {
        'my_field': sel.xpath(...),
        # ...  
    }

硒法（需要pip安装selenium、bs4，可能还需要从这里为您的操作系统下载chrome驱动程序：）支持Javascript解析！呜呼

使用以下代码，将范围切换到新帧：

# Goes at the top
from bs4 import BeautifulSoup 
from selenium.webdriver.chrome.options import Options
import time

# Your path depends on where you downloaded/located your chromedriver.exe
CHROME_PATH = 'C:\Program Files (x86)\Google\Chrome\Application\chrome.exe'
CHROMEDRIVER_PATH = 'chromedriver.exe'
WINDOW_SIZE = "1920,1080"

chrome_options = Options()
chrome_options.add_argument("--log-level=3")
chrome_options.add_argument("--headless") # Speeds things up if you don't need gui
chrome_options.add_argument("--window-size=%s" % WINDOW_SIZE)

chrome_options.binary_location = CHROME_PATH

browser = webdriver.Chrome(executable_path=CHROMEDRIVER_PATH, chrome_options=chrome_options)

url = "example_js_site.com" # Your site goes here
browser.get(url)
time.sleep(3) # An unsophisticated way to wait for the new page to load.
browser.switch_to.frame(0)

soup = BeautifulSoup(browser.page_source.encode('utf-8').strip(), 'lxml')

# This will return any content found in tags called '<table>'
table = soup.find_all('table')

#位于顶部
从bs4导入BeautifulSoup
从selenium.webdriver.chrome.options导入选项
导入时间
#您的路径取决于您下载/定位chromedriver.exe的位置
CHROME\u路径='C:\Program Files（x86）\Google\CHROME\Application\CHROME.exe'
CHROMEDRIVER_PATH='CHROMEDRIVER.exe'
窗口大小=“19201080”
chrome_options=options（）
chrome\u选项。添加\u参数（“--log level=3”）
chrome_选项。添加_参数（“--headless”）#如果不需要gui，可以加快速度
chrome\u选项。添加参数（“--window size=%s”%window\u size）
chrome\u options.binary\u location=chrome\u路径
浏览器=webdriver.Chrome（可执行文件路径=CHROMEDRIVER路径，Chrome选项=Chrome选项）
url=“example_js_site.com”#您的站点位于此处
browser.get（url）
time.sleep（3）#等待新页面加载的简单方式。
浏览器.切换到.frame（0）
soup=BeautifulSoup（browser.page_source.encode（'utf-8'）.strip（），'lxml'）
#这将返回在名为“”的标记中找到的任何内容
table=soup.find_all（'table'））

这两个选项中我最喜欢的是Selenium，但如果您对第一个解决方案更满意，请尝试它

添加了代码，请看一下。我想知道

浏览器的库来自哪里？scrapy splash支持吗？@SangHuynh SeleniumIt很好，非常感谢。@SangHuynh如果这对你有用，你能接受它作为答案吗？