Web scraping IMPORTXML在从网站中抓取数据时显示错误
我正试图从这个网站上搜刮100所大学的名单 使用Web scraping IMPORTXML在从网站中抓取数据时显示错误,web-scraping,google-sheets,google-sheets-formula,Web Scraping,Google Sheets,Google Sheets Formula,我正试图从这个网站上搜刮100所大学的名单 使用=IMPORTXML(“https://www.topuniversities.com/university-rankings/usa-rankings/2021“,”/*[@id='ranking-data-load']/div[1]/div/div/div/div/div[2]” 显示错误:导入的内容为空 如何使用xpath获取所需数据?我在开发人员工具中找到了这个xhr请求 https://www.topuniversities.com/si
=IMPORTXML(“https://www.topuniversities.com/university-rankings/usa-rankings/2021“,”/*[@id='ranking-data-load']/div[1]/div/div/div/div/div[2]”
显示错误:导入的内容为空代码>
如何使用xpath获取所需数据?我在开发人员工具中找到了这个xhr请求
https://www.topuniversities.com/sites/default/files/qs-rankings-data/en/3738856.txt?1622189434?v=1622361479157
除非呈现JavaScript,否则xpath将无法工作
为了做到这一点,你有两个选择
- selenium/webbrowser(需要webdriver)chrome或Firefox等
- 收集适当的标题和数据,以便通过请求模块发送请求
代码呢
import requests
URL = 'https://www.topuniversities.com/sites/default/files/qs-rankings-data/en/3738856.txt?1622189434?v=1622361479157'
headers = {
"Host": "www.topuniversities.com",
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux armv8l; rv:88.0) Gecko/20100101 Firefox/88.0",
"Accept": "application/json, text/javascript, */*; q=0.01",
"Accept-Language": "en-US,en;q=0.5",
"Accept-Encoding": "gzip, deflate",
"Referer": "https://www.topuniversities.com/university-rankings/usa-rankings/2021",
"X-Requested-With": "XMLHttpRequest",
"via": "1.1 google"
}
datas = requests.get(URL, headers=headers).json()
import re
for i in datas['data']:
for j in re.findall('class="uni-link">(.*)</a>',i['title']):
print(j)
@rene你能告诉我你是如何找到开发者工具和这个url的:@vish我没有写答案,我只是编辑了一下,让它变得清晰一点。我不知道这个用户是如何得到他们的答案的。@Sheshanandh您能告诉我您是如何找到XHR请求开发工具和这个url的吗
Harvard University
Stanford University
Massachusetts Institute of Technology (MIT)
University of California, Berkeley (UCB)
University of California, Los Angeles (UCLA)
Yale University