使用Python解析网页_Python_Html_Parsing_Web Scraping_Beautifulsoup

使用Python解析网页

python html parsing web-scraping

使用Python解析网页,python,html,parsing,web-scraping,beautifulsoup,Python,Html,Parsing,Web Scraping,Beautifulsoup,我对Python完全陌生，需要一些帮助我正在尝试解析网页并从网页上检索电子邮件地址。我在网上读过很多东西，但都失败了我意识到，当运行BeautifulSoupbrowser.page_source时，它会带出源代码，但由于某些原因，它不会带上电子邮件地址或业务概要以下是我的代码，请不要判断：- import os, random, sys, time from urllib.parse import urlparse from selenium import webdriver f

我对Python完全陌生，需要一些帮助

我正在尝试解析网页并从网页上检索电子邮件地址。我在网上读过很多东西，但都失败了

我意识到，当运行BeautifulSoupbrowser.page_source时，它会带出源代码，但由于某些原因，它不会带上电子邮件地址或业务概要

以下是我的代码，请不要判断：-

import os, random, sys, time

from urllib.parse import urlparse

from selenium import webdriver

from bs4 import BeautifulSoup

from webdriver_manager.chrome import ChromeDriverManager

import lxml

browser = webdriver.Chrome('./chromedriver.exe')

url = ('https://www.yellowpages.co.za/search?what=accountant&where=cape+town&pg=1')
browser.get(url)

BeautifulSoup(browser.page_source)

旁注：我的目标是根据搜索条件浏览网页，并解析每一页的电子邮件地址，我知道了如何浏览网页和发送密钥，这只是我一直坚持的解析。非常感谢您的帮助

我建议您使用“请求”模块获取页面来源：

from requests import get

url = 'https://www.yellowpages.co.za/search?what=accountant&where=cape+town&pg=1'
src = get(url).text  # Gets the Page Source

之后，我搜索了电子邮件格式的单词，并将它们添加到列表中：

src = src.split('<body>')[1]  # Splits it and gets the <body> part

emails = []

for ind, char in enumerate(src):
    if char == '@':
        add = 1  # Count the characteres after and before
        new_char = src[ind+add]  # New character to add to the email
        email = char  # The full email (not yet)

        while new_char not in '<>":':
            email += new_char  # Add to email

            add += 1                   # Readjust
            new_char = src[ind + add]  # Values

        if '.' not in email or email.endswith('.'):  # This means that the email is 
            continue                                 # not fully in the page

        add = 1                    # Readjust
        new_char = src[ind - add]  # Values

        while new_char not in '<>":':
            email = new_char + email  # Add to email

            add += 1                   # Readjust
            new_char = src[ind - add]  # Values

        emails.append(email)

这回答了你的问题吗？谢谢你，拉斐尔，我试了一下，同样的事情也发生了。当我打印源代码时，似乎遗漏了整个第一部分，其中包含所有电子邮件地址，只打印最后一部分。有什么建议吗？你说的“最后一部分”是什么意思？基本上，在实际网页的源代码中总共有3060行代码。当我们使用Python解析源代码时，它只需要从第1760行到第3060行的源代码

emails = set(emails)  # Remove Duplicates

print(*emails, sep='\n')