Web scraping 为什么可以';我不能访问车身中的信息吗?
[这是网站的源代码][1]我正在使用BeautifulSoup进行网页抓取,但在tbody中找不到tr;网站的源代码中实际上有tr;但是,find_all函数只能返回AD中的tr 我正在抓取的链接: 以下是我的一些代码:Web scraping 为什么可以';我不能访问车身中的信息吗?,web-scraping,beautifulsoup,Web Scraping,Beautifulsoup,[这是网站的源代码][1]我正在使用BeautifulSoup进行网页抓取,但在tbody中找不到tr;网站的源代码中实际上有tr;但是,find_all函数只能返回AD中的tr 我正在抓取的链接: 以下是我的一些代码: ```from bs4 import BeautifulSoup ```url = "https://cpj.org/data/killed/?status=Killed&motiveConfirmed%5B%5D=Confirmed&type%5B%5D=J
```from bs4 import BeautifulSoup
```url = "https://cpj.org/data/killed/?status=Killed&motiveConfirmed%5B%5D=Confirmed&type%5B%5D=Journalist&start_year=1992&end_year=2019&group_by=year"
```html = urlopen(url)
```soup = BeautifulSoup(html,'lxml')
```type(soup)
```tr = soup.find_all("tr")
```print(tr)
[1]: https://i.stack.imgur.com/NFwEV.png
数据是通过返回json的API请求的,也就是说,它是动态添加的,因此不会出现在登录页的请求中。您可以在网络选项卡中找到用于获取信息的API端点 您可以将其中一个参数更改为大于预期结果集的数字,然后检查是否需要进一步请求
import requests
r = requests.get('https://cpj.org/api/datamanager/reports/entries?distinct(personId)&includes=organizations,fullName,location,status,typeOfDeath,charges,startDisplay,mtpage,country,type,motiveConfirmed&sort=fullName&pageNum=1&pageSize=2000&in(status,%27Killed%27)&or(eq(type,%22media%20worker%22),in(motiveConfirmed,%27Confirmed%27))&in(type,%27Journalist%27)&ge(year,1992)&le(year,2019)').json()
否则,您可以执行一个初始调用,并验证还要发出多少请求以及修改url中的相应参数。您可以看到返回的页面计数
您可以在此处看到pagesize 20的相关部分:
{'rowCount': 1343,
'pageNum': 1,
'pageSize': '20',
'pageCount': 68,
用于获取所有结果的循环的所有相关信息都在那里
更改为较大的数字后,您可以看到以下内容:
'rowCount': 1343,
'pageNum': 1,
'pageSize': '2000',
'pageCount': 1,
可以使用以下命令转换为表格:
import requests
import pandas as pd
r = requests.get('https://cpj.org/api/datamanager/reports/entries?distinct(personId)&includes=organizations,fullName,location,status,typeOfDeath,charges,startDisplay,mtpage,country,type,motiveConfirmed&sort=fullName&pageNum=1&pageSize=2000&in(status,%27Killed%27)&or(eq(type,%22media%20worker%22),in(motiveConfirmed,%27Confirmed%27))&in(type,%27Journalist%27)&ge(year,1992)&le(year,2019)').json()
df = pd.DataFrame(r['data'])
print(df)
df样本:
检查实际计数并提出额外记录请求的示例:
import requests
import pandas as pd
request_number = 1000
with requests.Session() as s:
r = s.get('https://cpj.org/api/datamanager/reports/entries?distinct(personId)&includes=organizations,fullName,location,status,typeOfDeath,charges,startDisplay,mtpage,country,type,motiveConfirmed&sort=fullName&pageNum=1&pageSize=' + str(request_number) + '&in(status,%27Killed%27)&or(eq(type,%22media%20worker%22),in(motiveConfirmed,%27Confirmed%27))&in(type,%27Journalist%27)&ge(year,1992)&le(year,2019)').json()
df = pd.DataFrame(r['data'])
actual_number = r['rowCount']
if actual_number > request_number:
request_number = actual_number - request_number
r = s.get('https://cpj.org/api/datamanager/reports/entries?distinct(personId)&includes=organizations,fullName,location,status,typeOfDeath,charges,startDisplay,mtpage,country,type,motiveConfirmed&sort=fullName&pageNum=2&pageSize=' + str(request_number) + '&in(status,%27Killed%27)&or(eq(type,%22media%20worker%22),in(motiveConfirmed,%27Confirmed%27))&in(type,%27Journalist%27)&ge(year,1992)&le(year,2019)').json()
df2 = pd.DataFrame(r['data'])
final = pd.concat([df,df2])
else:
final = df
要使用您通过检查元素看到的选择器获取表格内容,您可以尝试使用我在下面介绍的如何使用的。以下方法是一种异步方法。因此,我建议您使用此api,除非您找到可以使用的api:
import asyncio
from pyppeteer import launch
url = "https://cpj.org/data/killed/?status=Killed&motiveConfirmed%5B%5D=Confirmed&type%5B%5D=Journalist&start_year=1992&end_year=2019&group_by=year"
async def get_table(link):
browser = await launch(headless=False)
[page] = await browser.pages()
await page.goto(link)
await page.waitForSelector("table.js-report-builder-table tr td")
for tr in await page.querySelectorAll("table.js-report-builder-table tr"):
tds = [await page.evaluate('e => e.innerText',td) for td in await tr.querySelectorAll("th,td")]
print(tds)
if __name__ == '__main__':
loop = asyncio.get_event_loop()
loop.run_until_complete(get_table(url))
输出如下:
['Name', 'Organization', 'Date', 'Location', 'Attack', 'Type of Death', 'Charge']
['Abadullah Hananzai', 'Radio Azadi,Radio Free Europe/Radio Liberty', 'April 30, 2018', 'Afghanistan', 'Killed', 'Murder', '']
['Abay Hailu', 'Agiere', 'February 9, 1998', 'Ethiopia', 'Killed', 'Dangerous Assignment', '']
['Abd al-Karim al-Ezzo', 'Freelance', 'December 21, 2012', 'Syria', 'Killed', 'Crossfire', '']
['Abdallah Bouhachek', 'Révolution et Travail', 'February 10, 1996', 'Algeria', 'Killed', 'Murder', '']
['Abdel Aziz Mahmoud Hasoun', 'Masar Press', 'September 5, 2013', 'Syria', 'Killed', 'Crossfire', '']
['Abdel Karim al-Oqda', 'Shaam News Network', 'September 19, 2012', 'Syria', 'Killed', 'Murder', '']
非常感谢你!在“r=requests.get(”)中,您在哪里找到了链接?如果我想查找未经证实的死亡或媒体工作者而不是记者,该怎么办?我在“网络”选项卡中找到了它。使用f12打开开发工具,刷新页面并检查web流量。您将看到不同的URI提供给页面。请看一看,看是否可以找到它,然后告诉我。这是一个很好的解决方案。这不值得投反对票。