Python 如何使用BeautifulSoup在站点上迭代多个内部链接以输出所有电子邮件地址？_Python_Loops_Web Scraping_Beautifulsoup_Screen Scraping

Python 如何使用BeautifulSoup在站点上迭代多个内部链接以输出所有电子邮件地址？

python loops web-scraping

Python 如何使用BeautifulSoup在站点上迭代多个内部链接以输出所有电子邮件地址？,python,loops,web-scraping,beautifulsoup,screen-scraping,Python,Loops,Web Scraping,Beautifulsoup,Screen Scraping,我正在尝试从内部字母索引中检索所有电子邮件地址基本上，我正在寻找一种使用BSoup的方法，首先浏览所有不同的字母表链接，然后浏览每个公司页面以打印所有相应的电子邮件地址我已经能够打印出网站上所有公司的列表，但我不确定如何迭代其他级别的链接。我考虑过使用字典，并分别为每个字母创建键，但我似乎无法让它发挥作用这是迄今为止成功提取所有公司名称的代码，以及一个用于逐个提取电子邮件地址的正则表达式。如何最好地一次打印所有电子邮件地址欢迎您的任何意见 from bs4 import Beautifu

我正在尝试从内部字母索引中检索所有电子邮件地址

基本上，我正在寻找一种使用BSoup的方法，首先浏览所有不同的字母表链接，然后浏览每个公司页面以打印所有相应的电子邮件地址

我已经能够打印出网站上所有公司的列表，但我不确定如何迭代其他级别的链接。我考虑过使用字典，并分别为每个字母创建键，但我似乎无法让它发挥作用

这是迄今为止成功提取所有公司名称的代码，以及一个用于逐个提取电子邮件地址的正则表达式。如何最好地一次打印所有电子邮件地址

欢迎您的任何意见

from bs4 import BeautifulSoup
import requests

alphabet = ['a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z']
#alphabet = ['a']

resultsdict = {}
companyname = []
url1 = 'http://www.indiainfoline.com/Markets/Company/'
url2 = '.aspx'
for element in alphabet:
    html = requests.get(url1 + element + url2).text
    bs = BeautifulSoup(html)
    # find the links to companies
    company_menu = bs.find("div",{'style':'padding-left:5px'})
    # print all companies links
    companies = company_menu.find_all('a')
    for company in companies:
        print company.getText().strip()







import re
# example company page
html = requests.get('http://www.indiainfoline.com/Markets/Company/Adani-Power-    Ltd/533096').text
EMAIL_REGEX = re.compile("mailto:([A-Za-z0-9.\-+]+@[A-Za-z0-9_\-]+[.][a-zA-Z]{2,4})")
re.findall(EMAIL_REGEX, html)

一个做过很多网页清理工作的人的建议是：在公司链接上做一个循环，打开页面并获取它找到的电子邮件（或任何你想要的数据）。我在页面上只看到了一个电子邮件链接，所以它找到的那个就可以了。一个粗略的例子：

for company in companies:
    company_html = requests.get(company['href'])
    company_bs = BeautifulSoup(company_html)
    company_page_links = company_bs('a')
    for link in company_page_links:
        if link['href'].startswith('mailto:'):
            #You found the e-mail address!
            break#Exits the loop, as you already found the address

for company in companies:
    company_html = requests.get(company['href'])
    company_bs = BeautifulSoup(company_html)
    company_page_links = company_bs('a')
    for link in company_page_links:
        if link['href'].startswith('mailto:'):
            #You found the e-mail address!
            break#Exits the loop, as you already found the address

谢谢你的回复！这是个好主意，但我似乎无法使它发挥作用。我不太熟悉使用循环，所以我的语法可能不正确，但我一直收到一个MissingSchema错误。感谢您的回复！这是个好主意，但我似乎无法使它发挥作用。我不太熟悉循环的使用，所以我的语法可能不正确，但我一直遇到一个MissingSchema错误。