用Python实现网页抓取JavaScript页面

用Python实现网页抓取JavaScript页面,python,web-scraping,python-2.x,urlopen,Python,Web Scraping,Python 2.x,Urlopen,我正在尝试开发一个简单的网页刮板。我想提取没有HTML代码的文本。事实上,我实现了这个目标,但我看到在一些加载JavaScript的页面中,我并没有获得好的结果 例如,如果一些JavaScript代码添加了一些文本,我就看不到它,因为当我调用 response = urllib2.urlopen(request) 我得到的是原始文本,而没有添加的文本(因为JavaScript是在客户端执行的) 所以,我正在寻找一些解决这个问题的方法 听起来您真正想要的数据可以通过主页上的javascript调

我正在尝试开发一个简单的网页刮板。我想提取没有HTML代码的文本。事实上,我实现了这个目标,但我看到在一些加载JavaScript的页面中,我并没有获得好的结果

例如,如果一些JavaScript代码添加了一些文本,我就看不到它,因为当我调用

response = urllib2.urlopen(request)
我得到的是原始文本,而没有添加的文本(因为JavaScript是在客户端执行的)


所以,我正在寻找一些解决这个问题的方法

听起来您真正想要的数据可以通过主页上的javascript调用的辅助URL访问


虽然您可以尝试在服务器上运行javascript来处理此问题,但一种更简单的方法可能是使用Firefox加载页面,并使用类似于或的工具来准确识别辅助URL。然后你可以直接查询你感兴趣的数据的URL。

编辑2017年12月30日:这个答案出现在谷歌搜索的顶级结果中,所以我决定更新它。旧的答案仍在最后

不再维护dryscape,dryscape开发人员推荐的库仅限于Python 2。我发现将Selenium的python库与Phantom JS一起用作web驱动程序足够快,而且很容易完成工作

安装后,确保当前路径中有可用的
phantomjs
二进制文件:

phantomjs --version
# result:
2.1.1
例子 举个例子,我用下面的HTML代码创建了一个示例页面。():

使用JS支持进行抓取:
您还可以使用Python库来抓取javascript驱动的网站

使用JS支持进行抓取:
导入干刮
从bs4导入BeautifulSoup
session=drysrape.session()
会话.访问(我的url)
response=session.body()
汤=美汤(响应)
soup.find(id=“简介文本”)
#结果:

耶!支持javascript


这似乎也是一个很好的解决方案,从

也许你能做到

from selenium import webdriver
import time

driver = webdriver.Firefox()
driver.get(url)
time.sleep(5)
htmlSource = driver.page_source

您还可以使用webdriver执行javascript

from selenium import webdriver

driver = webdriver.Firefox()
driver.get(url)
driver.execute_script('document.title')
from selenium import webdriver

browser = webdriver.Chrome()

browser.get("https://www.python.org/")

nav = browser.find_element_by_id("mainnav")

print(nav.text)
或者将值存储在变量中

result = driver.execute_script('var text = document.title ; return var')

您需要在脚本中为页面的不同部分使用urllib、requests、beautifulSoup和selenium web驱动程序(仅举几例)。
有时,您只需使用其中一个模块即可获得所需的功能。
有时您需要两个、三个或所有这些模块。
有时您需要关闭浏览器上的js。
有时,脚本中需要标题信息。
没有一个网站可以用同样的方式被抓取,也没有一个网站可以永远用同样的方式被抓取而不需要修改你的爬虫,通常是在几个月后。但它们都可以刮!有志者事竟成。
如果将来需要不断地刮取数据,只需刮取所有需要的数据,并使用pickle将其存储在.dat文件中。

只需继续搜索如何使用这些模块尝试什么,并将错误复制和粘贴到Google中。

Selenium最适合抓取JS和Ajax内容

检查这篇文章是否有错误

然后下载ChromeWebDriver

from selenium import webdriver

driver = webdriver.Firefox()
driver.get(url)
driver.execute_script('document.title')
from selenium import webdriver

browser = webdriver.Chrome()

browser.get("https://www.python.org/")

nav = browser.find_element_by_id("mainnav")

print(nav.text)

很简单,对吧?

如果您以前使用过python的
请求
模块,我最近发现开发人员创建了一个名为
请求HTML
的新模块,现在它还可以呈现JavaScript

您还可以访问以了解有关该模块的更多信息,或者如果您只对呈现JavaScript感兴趣,则可以访问以直接了解如何使用该模块使用Python呈现JavaScript

基本上,一旦您正确安装了
Requests HTML
模块,下面的示例将显示如何使用此模块刮取网站并呈现网站中包含的JavaScript:

from requests_html import HTMLSession
session = HTMLSession()

r = session.get('http://python-requests.org/')

r.html.render()

r.html.search('Python 2 will retire in only {months} months!')['months']

'<time>25</time>' #This is the result.
来自请求\u html导入HTMLSession
session=HTMLSession()
r=会话。获取('http://python-requests.org/')
r、 html.render()
r、 search('Python2将在{months}months!'内退役)['months']
“25”#这是结果。

我最近从YouTube视频中了解到了这一点。观看YouTube视频,该视频演示了模块的工作原理。

BeautifulSoup和Selenium的组合对我来说效果非常好

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup as bs

driver = webdriver.Firefox()
driver.get("http://somedomain/url_that_delays_loading")
    try:
        element = WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.ID, "myDynamicElement"))) #waits 10 seconds until element is located. Can have other wait conditions  such as visibility_of_element_located or text_to_be_present_in_element

        html = driver.page_source
        soup = bs(html, "lxml")
        dynamic_text = soup.find_all("p", {"class":"class_name"}) #or other attributes, optional
    else:
        print("Couldnt locate element")

另外,你可以找到更多的等待条件

我个人更喜欢使用scrapy和selenium,并将它们分别放在不同的容器中。通过这种方式,您既可以轻松安装,也可以抓取几乎都以某种形式包含javascript的现代网站。下面是一个例子:

使用
scrapy startproject
创建刮板并编写爬行器,骨架可以如此简单:

import scrapy


class MySpider(scrapy.Spider):
    name = 'my_spider'
    start_urls = ['https://somewhere.com']

    def start_requests(self):
        yield scrapy.Request(url=self.start_urls[0])


    def parse(self, response):

        # do stuff with results, scrape items etc.
        # now were just checking everything worked

        print(response.body)
真正的魔法发生在中间件中。以以下方式覆盖下载程序中间件中的两个方法,
初始化和
处理请求

# import some additional modules that we need
import os
from copy import deepcopy
from time import sleep

from scrapy import signals
from scrapy.http import HtmlResponse
from selenium import webdriver

class SampleProjectDownloaderMiddleware(object):

def __init__(self):
    SELENIUM_LOCATION = os.environ.get('SELENIUM_LOCATION', 'NOT_HERE')
    SELENIUM_URL = f'http://{SELENIUM_LOCATION}:4444/wd/hub'
    chrome_options = webdriver.ChromeOptions()

    # chrome_options.add_experimental_option("mobileEmulation", mobile_emulation)
    self.driver = webdriver.Remote(command_executor=SELENIUM_URL,
                                   desired_capabilities=chrome_options.to_capabilities())


def process_request(self, request, spider):

    self.driver.get(request.url)

    # sleep a bit so the page has time to load
    # or monitor items on page to continue as soon as page ready
    sleep(4)

    # if you need to manipulate the page content like clicking and scrolling, you do it here
    # self.driver.find_element_by_css_selector('.my-class').click()

    # you only need the now properly and completely rendered html from your page to get results
    body = deepcopy(self.driver.page_source)

    # copy the current url in case of redirects
    url = deepcopy(self.driver.current_url)

    return HtmlResponse(url, body=body, encoding='utf-8', request=request)
不要忘记通过取消对settings.py文件中的下一行的注释来启用此middlware:

DOWNLOADER_MIDDLEWARES = {
'sample_project.middlewares.SampleProjectDownloaderMiddleware': 543,}
接下来是dockerization。从轻量级映像创建
Dockerfile
(我在这里使用的是python Alpine),将项目目录复制到其中,安装要求:

# Use an official Python runtime as a parent image
FROM python:3.6-alpine

# install some packages necessary to scrapy and then curl because it's  handy for debugging
RUN apk --update add linux-headers libffi-dev openssl-dev build-base libxslt-dev libxml2-dev curl python-dev

WORKDIR /my_scraper

ADD requirements.txt /my_scraper/

RUN pip install -r requirements.txt

ADD . /scrapers
最后,在docker compose.yaml中将所有内容组合在一起:

version: '2'
services:
  selenium:
    image: selenium/standalone-chrome
    ports:
      - "4444:4444"
    shm_size: 1G

  my_scraper:
    build: .
    depends_on:
      - "selenium"
    environment:
      - SELENIUM_LOCATION=samplecrawler_selenium_1
    volumes:
      - .:/my_scraper
    # use this command to keep the container running
    command: tail -f /dev/null
运行
docker compose up-d
。如果你第一次这么做,它需要一段时间来获取最新的selenium/standalone chrome和构建你的刮板图像

完成后,您可以检查您的容器是否使用
docker ps
运行,还可以检查selenium容器的名称是否与我们传递给scraper容器的环境变量的名称匹配(这里是
selenium\u LOCATION=samplecrawler\u selenium\u 1

使用
docker exec-ti your\u container\u NAME sh
输入您的刮板容器,我的命令是
docker exec-ti samplecrawler\u my\u scraper\u 1 sh
,cd进入正确的目录,并使用
scrapy crawl my\u spider
运行刮板

整个事情都在我的github上
# import some additional modules that we need
import os
from copy import deepcopy
from time import sleep

from scrapy import signals
from scrapy.http import HtmlResponse
from selenium import webdriver

class SampleProjectDownloaderMiddleware(object):

def __init__(self):
    SELENIUM_LOCATION = os.environ.get('SELENIUM_LOCATION', 'NOT_HERE')
    SELENIUM_URL = f'http://{SELENIUM_LOCATION}:4444/wd/hub'
    chrome_options = webdriver.ChromeOptions()

    # chrome_options.add_experimental_option("mobileEmulation", mobile_emulation)
    self.driver = webdriver.Remote(command_executor=SELENIUM_URL,
                                   desired_capabilities=chrome_options.to_capabilities())


def process_request(self, request, spider):

    self.driver.get(request.url)

    # sleep a bit so the page has time to load
    # or monitor items on page to continue as soon as page ready
    sleep(4)

    # if you need to manipulate the page content like clicking and scrolling, you do it here
    # self.driver.find_element_by_css_selector('.my-class').click()

    # you only need the now properly and completely rendered html from your page to get results
    body = deepcopy(self.driver.page_source)

    # copy the current url in case of redirects
    url = deepcopy(self.driver.current_url)

    return HtmlResponse(url, body=body, encoding='utf-8', request=request)
DOWNLOADER_MIDDLEWARES = {
'sample_project.middlewares.SampleProjectDownloaderMiddleware': 543,}
# Use an official Python runtime as a parent image
FROM python:3.6-alpine

# install some packages necessary to scrapy and then curl because it's  handy for debugging
RUN apk --update add linux-headers libffi-dev openssl-dev build-base libxslt-dev libxml2-dev curl python-dev

WORKDIR /my_scraper

ADD requirements.txt /my_scraper/

RUN pip install -r requirements.txt

ADD . /scrapers
version: '2'
services:
  selenium:
    image: selenium/standalone-chrome
    ports:
      - "4444:4444"
    shm_size: 1G

  my_scraper:
    build: .
    depends_on:
      - "selenium"
    environment:
      - SELENIUM_LOCATION=samplecrawler_selenium_1
    volumes:
      - .:/my_scraper
    # use this command to keep the container running
    command: tail -f /dev/null
DOWNLOADER_MIDDLEWARES = {
      'scrapy_splash.SplashCookiesMiddleware': 723,
      'scrapy_splash.SplashMiddleware': 725,
      'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
SPLASH_URL = 'http://localhost:8050'
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
class MySpider(scrapy.Spider):
    name = "jsscraper"
    start_urls = ["http://quotes.toscrape.com/js/"]

    def start_requests(self):
        for url in self.start_urls:
        yield SplashRequest(
            url=url, callback=self.parse, endpoint='render.html'
        )

    def parse(self, response):
        for q in response.css("div.quote"):
        quote = QuoteItem()
        quote["author"] = q.css(".author::text").extract_first()
        quote["quote"] = q.css(".text::text").extract_first()
        yield quote
from requests_html import HTMLSession

session = HTMLSession()
r = session.get(a_page_url)
r.html.render()
from PyQt5.QtWidgets import QApplication
from PyQt5.QtCore import QUrl
from PyQt5.QtWebEngineWidgets import QWebEnginePage
import sys
import bs4 as bs
import urllib.request


class Client(QWebEnginePage):
    def __init__(self,url):
        global app
        self.app = QApplication(sys.argv)
        QWebEnginePage.__init__(self)
        self.html = ""
        self.loadFinished.connect(self.on_load_finished)
        self.load(QUrl(url))
        self.app.exec_()

    def on_load_finished(self):
        self.html = self.toHtml(self.Callable)
        print("Load Finished")

    def Callable(self,data):
        self.html = data
        self.app.quit()

# url = ""
# client_response = Client(url)
# print(client_response.html)
 result = driver.execute_script('var text = document.title ; return text')
from requests_html import HTMLSession
session = HTMLSession()
response = session.request(method="get",url="www.google.com/")
response.html.render()
script = """
    () => {
        return {
            width: document.documentElement.clientWidth,
            height: document.documentElement.clientHeight,
            deviceScaleFactor: window.devicePixelRatio,
        }
    } 
"""
>>> response.html.render(script=script)
{'width': 800, 'height': 600, 'deviceScaleFactor': 1}
return {
    data: window.view.data
}
import asyncio
from pyppeteer import launch

async def main():
    browser = await launch({"headless": True})
    [page] = await browser.pages()

    # normally, you go to a live site...
    #await page.goto("http://www.example.com")
    # but for this example, just set the HTML directly:
    await page.setContent("""
    <body>
    <script>
    // inject content dynamically with JS, not part of the static HTML!
    document.body.innerHTML = `<p>hello world</p>`; 
    </script>
    </body>
    """)
    print(await page.content()) # shows that the `<p>` was inserted

    # evaluate a JS expression in browser context and scrape the data
    expr = "document.querySelector('p').textContent"
    print(await page.evaluate(expr, force_expr=True)) # => hello world

    await browser.close()

asyncio.get_event_loop().run_until_complete(main())
import requests
custom_User_agent = "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0"
url = "https://www.abc.xyz/your/url"
response = requests.get(url, headers={"User-Agent": custom_User_agent})
html_text = response.text