Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/355.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Javascript 保存动态加载的网页_Javascript_Python_Reactjs_Web_Web Crawler - Fatal编程技术网

Javascript 保存动态加载的网页

Javascript 保存动态加载的网页,javascript,python,reactjs,web,web-crawler,Javascript,Python,Reactjs,Web,Web Crawler,这应该是一项简单的任务,但我无法处理,因为我对web架构一无所知(甚至非常基本) 我想访问https://www.coursera.org/browse/arts-and-humanities/history带有一些过滤器(例如,语言=英语): 加载此网页后,许多课程在向下滚动之前不会显示。如果我将html文件保存到本地,我只能找到58个https://www.coursera.org/learn/,一门课程的前缀,但我想至少有128门 那么,现在我如何保存动态加载的网页,无论是使用Chrome

这应该是一项简单的任务,但我无法处理,因为我对web架构一无所知(甚至非常基本)

我想访问
https://www.coursera.org/browse/arts-and-humanities/history
带有一些过滤器(例如,
语言=英语
):

加载此网页后,许多课程在向下滚动之前不会显示。如果我将html文件保存到本地,我只能找到58个
https://www.coursera.org/learn/
,一门课程的前缀,但我想至少有128门

那么,现在我如何保存动态加载的网页,无论是使用Chrome还是Python

使用@Rajat的代码,仿真器可以向下滚动到底部,但获得的html仍然不完整

import os
from bs4 import BeautifulSoup
import time
from selenium import webdriver

current_dir=os.getcwd()
#download chromedriver for you operating system
driver = webdriver.Chrome(current_dir+'/chromedriver')
#place your url here
url="https://www.coursera.org/browse/arts-and-humanities/history?facets=skillNameMultiTag%2CjobTitleMultiTag%2CdifficultyLevelTag%2Clanguages%3AEnglish%2CentityTypeTag%2CpartnerMultiTag%2CcategoryMultiTag%2CsubcategoryMultiTag%3Ahistory&sortField="
driver.get(url)
count = 1200
step = 30
for _ in range(count):
    driver.execute_script("window.scrollBy(0, {});".format(step))
    time.sleep(0.01)

with open("output.html", "w") as file:
    file.write(driver.page_source)


你应该使用selenium web驱动程序,我使用chromedriver来完成这项工作,它将打开你的网页并执行向下滚动功能,你只需要确定一些实现向下滚动的条件

import os
from bs4 import BeautifulSoup

from selenium import webdriver

current_dir=os.getcwd()
#download chromedriver for you operating system
driver = webdriver.Chrome(current_dir+'/chromedriver')
#place your url here
url="https://stackoverflow.com"
driver.get(url)
#you can use how many scroll do you want using loop
driver.execute_script("window.scrollTo(0, {});".format(count*1400))
time.sleep(2)

inner_html=driver.page_source
soup=BeautifulSoup(inner_html,'html.parser')

这里的soup将包含此网页的所有html数据

似乎他们正在使用graphql获取结果。网站上似乎也没有任何身份验证。您可以使用任何您喜欢的工具(python、curl、postman等)使用一个简单的post调用来获得结果。由于您的原始代码是用python编写的,下面是一个使用python的简单代码片段:

#!/usr/bin/env python

import requests
import json
import warnings
warnings.filterwarnings("ignore")

def getHeadersb345e918473d():
    result={}
    result['content-type']='application/json'
    return result

def json_data_e6084285():
    result=[]
    result_item0={}
    result_item0['query']='query catalogResultQuery($facets: [String!]!, $start: String!, $skip: Boolean = false, $sortField: String, $limit: Int) { CatalogResultsV2Resource { browseV2(facets: $facets, start: $start, limit: $limit, sortField: $sortField) @skip(if: $skip) { elements { label entries { id score courseId specializationId onDemandSpecializationId resourceName __typename } domainId subdomainId facets courses { elements { ...CourseFragment __typename } __typename } s12ns { elements { ...S12nFragment __typename } __typename } __typename } paging { total next __typename } __typename } __typename } } fragment CourseFragment on CoursesV1 { id slug name photoUrl s12nIds level workload courseDerivativesV2 { skillTags { skillName relevanceScore __typename } avgLearningHoursAdjusted commentCount averageFiveStarRating ratingCount __typename } partners { elements { name squareLogo classLogo logo __typename } __typename } __typename } fragment S12nFragment on OnDemandSpecializationsV1 { name id slug logo courseIds derivativeV2 { averageFiveStarRating avgLearningHoursAdjusted __typename } partners { elements { name squareLogo classLogo logo __typename } __typename } metadata { headerImage level __typename } courses { elements { courseDerivativesV2 { skillTags { skillName relevanceScore __typename } __typename } __typename } __typename } __typename } '
    variables={}
    variables['skip']=False
    facets=[]
    facets.append('skillNameMultiTag')
    facets.append('jobTitleMultiTag')
    facets.append('difficultyLevelTag')
    facets.append('languages:English')
    facets.append('entityTypeTag')
    facets.append('partnerMultiTag')
    facets.append('categoryMultiTag')
    facets.append('subcategoryMultiTag:history')
    variables['facets']=facets
    variables['limit']=300
    variables['start']='0'
    variables['sortField']=''
    result_item0['variables']=variables
    result_item0['operationName']='catalogResultQuery'
    result.append(result_item0)
    return result

url='https://www.coursera.org/graphqlBatch'
r=requests.post(url, headers=getHeadersb345e918473d(), data=json.dumps(json_data_e6084285()), verify=False )
print unicode(r.text)

您可以修改
限制值
开始值
以获得所需的结果。

我使用了@Gautam代码,只重建了它

第一个请求只提供100个项目(即使限制=300),所以使用
start
我得到接下来的28个项目

使用
json=
而不是
data=
我不需要
headers=
json.dump()

结果开始:

>>> len: 100
Buddhism and Modern Psychology 
English Composition I
Fashion as Design
The Modern World, Part One: Global History from 1760 to 1910
Indigenous Canada
Understanding Einstein: The Special Theory of Relativity
Terrorism and Counterterrorism: Comparing Theory and Practice
Magic in the Middle Ages
The Ancient Greeks
Introduction to Ancient Egypt and Its Civilization
>>> len: 28
Theatre and Globalization
ART of the MOOC: Arte Público y Pedagogía 
The Music of the Rolling Stones, 1962-1974
Soul Beliefs: Causes and Consequences - Unit 2: Belief Systems
The Making of the US President: A Short History in Five Elections
Cities are back in town : sociologie urbaine pour un monde globalisé
Toledo: Deciphering Secrets of Medieval Spain
Russia and Nuclear Arms Control
Espace mondial, a French vision of Global studies
Religious Transformation in Early China: the Period of Division
Patrick Henry: Forgotten Founder
A la recherche du Grand Paris
Burgos: Deciphering Secrets of Medieval Spain
Journey Conversations: Weaving Knowledge and Action
Structuring Values in Modern China
Religion and Thought in Modern China: the Song, Jin, and Yuan
宇宙之旅:展现生命 (Journey of the Universe: The Unfolding of Life)
The Worldview of Thomas Berry:  The Flourishing of the Earth Community
Science and Technology in the Silla Cultural Heritage
世界空间、法国视角下的国���研究
Fundamentals of the Chinese character writing (Part 1)
Understanding China, 1700-2000: A Data Analytic Approach, Part 2
"Espace mondial" الرؤية الفرنسية للدراسات العالمية
Searching for the Grand Paris
宇宙之旅:对话 (Journey of the Universe: Weaving Knowledge and Action)
Contemporary India 
Thomas Berry的世界观:地球社区的繁荣 (The Worldview of Thomas Berry: The Flourishing of the Earth Community)
"Making" Progress Teach-Out
结果结束:

>>> len: 100
Buddhism and Modern Psychology 
English Composition I
Fashion as Design
The Modern World, Part One: Global History from 1760 to 1910
Indigenous Canada
Understanding Einstein: The Special Theory of Relativity
Terrorism and Counterterrorism: Comparing Theory and Practice
Magic in the Middle Ages
The Ancient Greeks
Introduction to Ancient Egypt and Its Civilization
>>> len: 28
Theatre and Globalization
ART of the MOOC: Arte Público y Pedagogía 
The Music of the Rolling Stones, 1962-1974
Soul Beliefs: Causes and Consequences - Unit 2: Belief Systems
The Making of the US President: A Short History in Five Elections
Cities are back in town : sociologie urbaine pour un monde globalisé
Toledo: Deciphering Secrets of Medieval Spain
Russia and Nuclear Arms Control
Espace mondial, a French vision of Global studies
Religious Transformation in Early China: the Period of Division
Patrick Henry: Forgotten Founder
A la recherche du Grand Paris
Burgos: Deciphering Secrets of Medieval Spain
Journey Conversations: Weaving Knowledge and Action
Structuring Values in Modern China
Religion and Thought in Modern China: the Song, Jin, and Yuan
宇宙之旅:展现生命 (Journey of the Universe: The Unfolding of Life)
The Worldview of Thomas Berry:  The Flourishing of the Earth Community
Science and Technology in the Silla Cultural Heritage
世界空间、法国视角下的国���研究
Fundamentals of the Chinese character writing (Part 1)
Understanding China, 1700-2000: A Data Analytic Approach, Part 2
"Espace mondial" الرؤية الفرنسية للدراسات العالمية
Searching for the Grand Paris
宇宙之旅:对话 (Journey of the Universe: Weaving Knowledge and Action)
Contemporary India 
Thomas Berry的世界观:地球社区的繁荣 (The Worldview of Thomas Berry: The Flourishing of the Earth Community)
"Making" Progress Teach-Out

由于大多数动态内容通常在运行时由js处理和填充,使用web应用程序可能会变得很棘手。@varunagarwal您能否发布详细说明的答案?使用Python和Selenium,您可以控制web浏览器(Chrome/Firefox)和滚动页面,以便浏览器加载它-然后您可以获得HTML(使用Selenium)并使用标准的
open()
write()
close()
-保存到文件中-但它只保存HTML,不保存图像、js、css等。@furas是的,我完全按照你说的做了,但获得的HTML仍然像以前一样不完整。首先:你应该在开始时添加此代码。第二:您不需要BeautifulSoup来编写HTML-
编写(driver.page\u source)
,第三:我必须运行代码才能看到问题。嗨,Rajat,谢谢您的回答,但对我来说它不起作用。你能检查我的更新吗?是的,我会检查更新。据我所知,您的网页可能需要一些登录检查,或者是用javascript编写的。如果需要登录,您可以使用send_keys方法登录示例:首先,通过检查element email=driver来识别输入字段。通过_id('loginid')email.clear()email.send_keys('xyz@gmail.com)不需要日志记录。