Python 如何使用Beautiful Soup从所有子url获取信息'；它在某个url下？_Python_Python 3.x_Web Scraping_Beautifulsoup

Python 如何使用Beautiful Soup从所有子url获取信息'；它在某个url下？

python python-3.x web-scraping

Python 如何使用Beautiful Soup从所有子url获取信息'；它在某个url下？,python,python-3.x,web-scraping,beautifulsoup,Python,Python 3.x,Web Scraping,Beautifulsoup,我的用例是尝试从子url获取所有电子邮件，比如在父url下：我知道电子邮件的一般形式是xyz@xyz.com，因此定位单个url的电子邮件就足够容易了。但是当涉及到为所有子URL执行此操作时，我有点不知所措。在这里使用beautifulsoup没有任何意义，因为您可以直接从api获取数据。首先，您需要知道有多少个组织，以便在查询中使用这些组织。然后，通过抓取'WebsiteKey'或组织id，您可以迭代api来提取电子邮件。您可以存储在字典、表格、打印输出等中。不确定您真正想要的输出是什么 i

我的用例是尝试从子url获取所有电子邮件，比如在父url下：

我知道电子邮件的一般形式是xyz@xyz.com，因此定位单个url的电子邮件就足够容易了。但是当涉及到为所有子URL执行此操作时，我有点不知所措。

在这里使用beautifulsoup没有任何意义，因为您可以直接从api获取数据。首先，您需要知道有多少个组织，以便在查询中使用这些组织。然后，通过抓取

'WebsiteKey'

或组织

id

，您可以迭代api来提取电子邮件。您可以存储在字典、表格、打印输出等中。不确定您真正想要的输出是什么

import requests
import pandas as pd

url = 'https://blueprint.uchicago.edu/api/discovery/search/organizations'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36'}
payload = {
'orderBy[0]': 'UpperName asc',
'top': '',
'filter':'',
'query':'' ,
'skip': '0'}
data = requests.get(url, headers=headers, params=payload).json()

totalCount = data['@odata.count']
payload = {
'orderBy[0]': 'UpperName asc',
'top': '%s' %totalCount,
'filter':'',
'query':'' ,
'skip': '0'}


data = requests.get(url, headers=headers, params=payload).json()

organizations = {}
for each in data['value']:
    organizations[each['Name']] = {'id':each['Id'], 'WebsiteKey':each['WebsiteKey']}




emails = {}
for name, each in organizations.items():
    websiteKey = each['WebsiteKey']
    org_id = each['id']

    url = 'https://blueprint.uchicago.edu/api/discovery/organization/bykey/%s' %websiteKey
    data = requests.get(url, headers=headers).json()
    emails[name] = data['email']
    print('%-70s: %s' %(name, data['email']))

df = pd.DataFrame(list(zip(emails.keys(), emails.values())), columns=['Organization','Email'])
df.to_csv('file.csv', index=False)

输出：

{'A Cappella Council': 'uchicagoacappella@gmail.com', 'ACLU University of Chicago Law Chapter': 'dhbabrams@uchicago.edu', 'Active Minds at the University of Chicago': 'activemindsuchicago@gmail.com', 'African and Caribbean Student Association': 'cvleito@uchicago.edu', 'Aikido Kokikai': 'nahmadc@uchicago.edu', 'Alpha Kappa Psi': 'edwardchang@uchicago.edu', 'Alpha Phi Omega': 'uchi.apo.president@gmail.com', 'American Civil Liberties Union at University of Chicago': 'acluboard@lists.uchicago.edu', 'American Constitution Society': 'acs@law.uchicago.edu', 'American Medical Student Association': None, 'American Red Cross of University of Chicago': 'rkhouri@uchicago.edu', 'Amnesty International': 'eckere@uchicago.edu', 'Animal Legal Defense Fund - The University of Chicago Law School': 'ntschepik@uchicago.edu', 'Animal Welfare Society': 'petrucci@uchicago.edu', 'Anthropology Students Association': 'frevelolarotta@uchicago.edu', 'Apsara': 'uchicagoapsara@gmail.com', 'Arab Student Association': 'malakarafa@uchicago.edu', ...}

首先，考虑如何在web浏览器中手动执行此操作。然后想想自动化这个过程的方法。不要期望库自动解决整个问题，也不要要求StackOverflow为您编写代码。一旦你想出一个解决方案，如果你遇到问题，这就是寻求帮助使你的代码正常工作的正确地方。哇，你是如何找到api url的？我正试图为UT Austin组织做一件类似的事情：和输出查找csv，如果打开开发工具（Ctrl-Shft-I）并查看网络选项卡-->XHR，使用Python内置的csv库（导入csv）似乎非常简单。另外，只需使用pandas写入csv即可。我会更新代码哇，难以置信。非常感谢你。我查看了UT Austin orgs的Network->XHR，看起来有五个不同的api URL。我猜是第二个？（）使用“预览”选项卡查看返回的内容。然后你可以把范围缩小到你想要的那一个