For loop 从保存在CSV中的URL列表中刮取电子邮件-BeautifulSoup
我试图通过CSV格式保存的URL列表来解析电子邮件地址。但是,下面的代码只能从单个网站获取电子邮件地址。需要关于如何修改代码以循环浏览列表并将结果(电子邮件列表)保存到csv文件的建议For loop 从保存在CSV中的URL列表中刮取电子邮件-BeautifulSoup,for-loop,web-scraping,beautifulsoup,For Loop,Web Scraping,Beautifulsoup,我试图通过CSV格式保存的URL列表来解析电子邮件地址。但是,下面的代码只能从单个网站获取电子邮件地址。需要关于如何修改代码以循环浏览列表并将结果(电子邮件列表)保存到csv文件的建议 import requests import re import csv from bs4 import BeautifulSoup allLinks = [];mails=[] with open(r'url.csv', newline='') as csvfile: urls = csv.reade
import requests
import re
import csv
from bs4 import BeautifulSoup
allLinks = [];mails=[]
with open(r'url.csv', newline='') as csvfile:
urls = csv.reader(csvfile, delimiter=' ', quotechar='|')
links = []
for url in urls:
response = requests.get(url)
soup=BeautifulSoup(response.text,'html.parser')
links = [a.attrs.get('href') for a in soup.select('a[href]') ]
allLinks=set(links)
def findMails(soup):
for name in soup.find_all('a'):
if(name is not None):
emailText=name.text
match=bool(re.match('[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$',emailText))
if('@' in emailText and match==True):
emailText=emailText.replace(" ",'').replace('\r','')
emailText=emailText.replace('\n','').replace('\t','')
if(len(mails)==0)or(emailText not in mails):
print(emailText)
mails.append(emailText)
for link in allLinks:
if(link.startswith("http") or link.startswith("www")):
r=requests.get(link)
data=r.text
soup=BeautifulSoup(data,'html.parser')
findMails(soup)
else:
newurl=url+link
r=requests.get(newurl)
data=r.text
soup=BeautifulSoup(data,'html.parser')
findMails(soup)
mails=set(mails)
if(len(mails)==0):
print("NO MAILS FOUND")
当您想添加链接时,您正在覆盖链接
allLinks = [];mails=[]
urls = ['https://www.nus.edu.sg/', 'http://gwiconsulting.com/']
links = []
for url in urls:
response = requests.get(url)
soup=BeautifulSoup(response.text,'html.parser')
links += [a.attrs.get('href') for a in soup.select('a[href]') ]
allLinks=set(links)
最后,循环您的邮件并写入csv
import csv
with open("emails.csv", "w", encoding="utf-8-sig", newline='') as csv_file:
w = csv.writer(csv_file, delimiter = ",", quoting=csv.QUOTE_MINIMAL)
w.writerow(['Email'])
for mail in mails:
w.writerow(mail)
已根据您的修改脚本,但仅启动第二个url(),而不是第一个url nus链接生成www.nus.edu.sg-SSL连接失败SSL不受支持检查您是否确实能够连接到站点并使用raise_for_status()进行连接我们得到了HTTPError:503服务器错误:url服务不可用:我有一个70个url的列表,其中一些是无效的,有没有办法忽略这些无效的url并继续循环有效的url并返回结果?