Web scraping 为什么可以';我不能访问车身中的信息吗?

Web scraping 为什么可以';我不能访问车身中的信息吗?,web-scraping,beautifulsoup,Web Scraping,Beautifulsoup,[这是网站的源代码][1]我正在使用BeautifulSoup进行网页抓取,但在tbody中找不到tr;网站的源代码中实际上有tr;但是,find_all函数只能返回AD中的tr 我正在抓取的链接: 以下是我的一些代码: ```from bs4 import BeautifulSoup ```url = "https://cpj.org/data/killed/?status=Killed&motiveConfirmed%5B%5D=Confirmed&type%5B%5D=J

[这是网站的源代码][1]我正在使用BeautifulSoup进行网页抓取,但在tbody中找不到tr;网站的源代码中实际上有tr;但是,find_all函数只能返回AD中的tr

我正在抓取的链接:

以下是我的一些代码:

```from bs4 import BeautifulSoup

```url = "https://cpj.org/data/killed/?status=Killed&motiveConfirmed%5B%5D=Confirmed&type%5B%5D=Journalist&start_year=1992&end_year=2019&group_by=year"
```html = urlopen(url)
```soup = BeautifulSoup(html,'lxml')
```type(soup)
```tr = soup.find_all("tr")
```print(tr)


  [1]: https://i.stack.imgur.com/NFwEV.png

数据是通过返回json的API请求的,也就是说,它是动态添加的,因此不会出现在登录页的请求中。您可以在网络选项卡中找到用于获取信息的API端点

您可以将其中一个参数更改为大于预期结果集的数字,然后检查是否需要进一步请求

import requests

r = requests.get('https://cpj.org/api/datamanager/reports/entries?distinct(personId)&includes=organizations,fullName,location,status,typeOfDeath,charges,startDisplay,mtpage,country,type,motiveConfirmed&sort=fullName&pageNum=1&pageSize=2000&in(status,%27Killed%27)&or(eq(type,%22media%20worker%22),in(motiveConfirmed,%27Confirmed%27))&in(type,%27Journalist%27)&ge(year,1992)&le(year,2019)').json()
否则,您可以执行一个初始调用,并验证还要发出多少请求以及修改url中的相应参数。您可以看到返回的页面计数

您可以在此处看到pagesize 20的相关部分:

{'rowCount': 1343,
 'pageNum': 1,
 'pageSize': '20',
 'pageCount': 68,
用于获取所有结果的循环的所有相关信息都在那里

更改为较大的数字后,您可以看到以下内容:

'rowCount': 1343,
 'pageNum': 1,
 'pageSize': '2000',
 'pageCount': 1,
可以使用以下命令转换为表格:

import requests
import pandas as pd

r = requests.get('https://cpj.org/api/datamanager/reports/entries?distinct(personId)&includes=organizations,fullName,location,status,typeOfDeath,charges,startDisplay,mtpage,country,type,motiveConfirmed&sort=fullName&pageNum=1&pageSize=2000&in(status,%27Killed%27)&or(eq(type,%22media%20worker%22),in(motiveConfirmed,%27Confirmed%27))&in(type,%27Journalist%27)&ge(year,1992)&le(year,2019)').json()
df = pd.DataFrame(r['data'])
print(df)

df样本:


检查实际计数并提出额外记录请求的示例:

import requests
import pandas as pd

request_number = 1000

with requests.Session() as s:
    r = s.get('https://cpj.org/api/datamanager/reports/entries?distinct(personId)&includes=organizations,fullName,location,status,typeOfDeath,charges,startDisplay,mtpage,country,type,motiveConfirmed&sort=fullName&pageNum=1&pageSize=' + str(request_number) + '&in(status,%27Killed%27)&or(eq(type,%22media%20worker%22),in(motiveConfirmed,%27Confirmed%27))&in(type,%27Journalist%27)&ge(year,1992)&le(year,2019)').json()
    df = pd.DataFrame(r['data'])
    actual_number = r['rowCount']
    if actual_number > request_number:
        request_number = actual_number - request_number
        r = s.get('https://cpj.org/api/datamanager/reports/entries?distinct(personId)&includes=organizations,fullName,location,status,typeOfDeath,charges,startDisplay,mtpage,country,type,motiveConfirmed&sort=fullName&pageNum=2&pageSize=' + str(request_number) + '&in(status,%27Killed%27)&or(eq(type,%22media%20worker%22),in(motiveConfirmed,%27Confirmed%27))&in(type,%27Journalist%27)&ge(year,1992)&le(year,2019)').json()
        df2 = pd.DataFrame(r['data'])
        final = pd.concat([df,df2])
    else:
        final = df

要使用您通过检查元素看到的选择器获取表格内容,您可以尝试使用我在下面介绍的如何使用的。以下方法是一种异步方法。因此,我建议您使用此api,除非您找到可以使用的api:

import asyncio
from pyppeteer import launch

url = "https://cpj.org/data/killed/?status=Killed&motiveConfirmed%5B%5D=Confirmed&type%5B%5D=Journalist&start_year=1992&end_year=2019&group_by=year"

async def get_table(link):
    browser = await launch(headless=False)
    [page] = await browser.pages()
    await page.goto(link)
    await page.waitForSelector("table.js-report-builder-table tr td")
    for tr in await page.querySelectorAll("table.js-report-builder-table tr"):
        tds = [await page.evaluate('e => e.innerText',td) for td in await tr.querySelectorAll("th,td")]
        print(tds)

if __name__ == '__main__':
    loop = asyncio.get_event_loop()
    loop.run_until_complete(get_table(url))
输出如下:

['Name', 'Organization', 'Date', 'Location', 'Attack', 'Type of Death', 'Charge']
['Abadullah Hananzai', 'Radio Azadi,Radio Free Europe/Radio Liberty', 'April 30, 2018', 'Afghanistan', 'Killed', 'Murder', '']
['Abay Hailu', 'Agiere', 'February 9, 1998', 'Ethiopia', 'Killed', 'Dangerous Assignment', '']
['Abd al-Karim al-Ezzo', 'Freelance', 'December 21, 2012', 'Syria', 'Killed', 'Crossfire', '']
['Abdallah Bouhachek', 'Révolution et Travail', 'February 10, 1996', 'Algeria', 'Killed', 'Murder', '']
['Abdel Aziz Mahmoud Hasoun', 'Masar Press', 'September 5, 2013', 'Syria', 'Killed', 'Crossfire', '']
['Abdel Karim al-Oqda', 'Shaam News Network', 'September 19, 2012', 'Syria', 'Killed', 'Murder', '']

非常感谢你!在“r=requests.get(”)中,您在哪里找到了链接?如果我想查找未经证实的死亡或媒体工作者而不是记者,该怎么办?我在“网络”选项卡中找到了它。使用f12打开开发工具,刷新页面并检查web流量。您将看到不同的URI提供给页面。请看一看,看是否可以找到它,然后告诉我。这是一个很好的解决方案。这不值得投反对票。