用python从网页中提取电子邮件_Python_Beautifulsoup_Python Requests

用python从网页中提取电子邮件

python

用python从网页中提取电子邮件,python,beautifulsoup,python-requests,Python,Beautifulsoup,Python Requests,我发现下面的代码可以在一个网站（我想所有的网站）上抓取电子邮件 import re import requests import requests.exceptions from urllib.parse import urlsplit from collections import deque from bs4 import BeautifulSoup # starting url. replace google with your own url. starting_url = 'http

我发现下面的代码可以在一个网站（我想所有的网站）上抓取电子邮件

import re
import requests
import requests.exceptions
from urllib.parse import urlsplit
from collections import deque
from bs4 import BeautifulSoup

# starting url. replace google with your own url.
starting_url = 'http://www.miet.ac.in'

# a queue of urls to be crawled
unprocessed_urls = deque([starting_url])

# set of already crawled urls for email
processed_urls = set()

# a set of fetched emails
emails = set()

# process urls one by one from unprocessed_url queue until queue is empty
while len(unprocessed_urls):

    # move next url from the queue to the set of processed urls
    url = unprocessed_urls.popleft()
    processed_urls.add(url)

    # extract base url to resolve relative links
    parts = urlsplit(url)
    base_url = "{0.scheme}://{0.netloc}".format(parts)
    path = url[:url.rfind('/')+1] if '/' in parts.path else url

    # get url's content
    print("Crawling URL %s" % url)
    try:
        response = requests.get(url)
    except (requests.exceptions.MissingSchema, requests.exceptions.ConnectionError):
        # ignore pages with errors and continue with next url
        continue

    # extract all email addresses and add them into the resulting set
    # You may edit the regular expression as per your requirement
    new_emails = set(re.findall(r"[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.[a-z]+", response.text, re.I))
    emails.update(new_emails)
    print(emails)
    # create a beutiful soup for the html document
    soup = BeautifulSoup(response.text, 'lxml')

    # Once this document is parsed and processed, now find and process all the anchors i.e. linked urls in this document
    for anchor in soup.find_all("a"):
        # extract link url from the anchor
        link = anchor.attrs["href"] if "href" in anchor.attrs else ''
        # resolve relative links (starting with /)
        if link.startswith('/'):
            link = base_url + link
        elif not link.startswith('http'):
            link = path + link
        # add the new url to the queue if it was not in unprocessed list nor in processed list yet
        if not link in unprocessed_urls and not link in processed_urls:
            unprocessed_urls.append(link)

如何修改这样的代码以仅提取一个网页。。？我只需要针对一个网页，而不是整个网站。

只需删除汤中锚的

开头的所有行。查找所有（“a”）：

。您的文档应如下所示：

import re
import requests
import requests.exceptions
from urllib.parse import urlsplit
from collections import deque
from bs4 import BeautifulSoup

# starting url. replace google with your own url.
starting_url = 'http://www.miet.ac.in'

# a queue of urls to be crawled
unprocessed_urls = deque([starting_url])

# set of already crawled urls for email
processed_urls = set()

# a set of fetched emails
emails = set()

# process urls one by one from unprocessed_url queue until queue is empty
while len(unprocessed_urls):

    # move next url from the queue to the set of processed urls
    url = unprocessed_urls.popleft()
    processed_urls.add(url)

    # extract base url to resolve relative links
    parts = urlsplit(url)
    base_url = "{0.scheme}://{0.netloc}".format(parts)
    path = url[:url.rfind('/')+1] if '/' in parts.path else url

    # get url's content
    print("Crawling URL %s" % url)
    try:
        response = requests.get(url)
    except (requests.exceptions.MissingSchema, requests.exceptions.ConnectionError):
        # ignore pages with errors and continue with next url
        continue

    # extract all email addresses and add them into the resulting set
    # You may edit the regular expression as per your requirement
    new_emails = set(re.findall(r"[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.[a-z]+", response.text, re.I))
    emails.update(new_emails)
    print(emails)
    # create a beutiful soup for the html document
    soup = BeautifulSoup(response.text, 'lxml')

要使用Python生成随机电子邮件地址，请使用以下命令：

from faker import Faker

faker = Faker()

for i in range(12):
    print(f'{faker.email()}')

您的问题不清楚或不局限于特定问题，请检查并编辑您的问题与确切问题从网站（不是所有网站，但只有一个链接）获取电子邮件的问题删除循环似乎很简单。被认为是诚实的。。。。这样就更容易不用排队了，只需拔出bs4位并发出一个请求。@QHarr我对python的东西非常熟悉：）@QHarr你能帮我吗。。该网站将有一个JavaScript和代码不处理它？有解决办法吗？非常感谢。太好了。但在尝试使用此网站时

https://www.randomlists.com/email-addresses

，它不起作用。这是因为此网站在加载完网站后，会使用javascript加载电子邮件地址。因此，在对网页进行爬网时，地址还不在网页中，爬虫程序无法找到它们。您可以使用我的另一个答案，用Python生成随机电子邮件地址。非常感谢。处理JavaScript有什么好运气吗。。？因为目标是从任何有或没有JavaScript的页面获取电子邮件。