HTML电子邮件中的Python和BeautifulSoup抓取
我对使用beautifulsoup和python非常陌生,因此我的问题可能看起来很简单,或者好像我误解了什么,所以请耐心等待 我有一个O365邮箱,我需要在一个设定的时间段内清理它,然后将邮件正文提取成一个JSON字符串,然后通过API传递给web监视器。我有这个与其他一些来源的工作,但对于一封电子邮件,我不能得到所需的细节 导致问题的原始电子邮件如下所示:-HTML电子邮件中的Python和BeautifulSoup抓取,python,beautifulsoup,Python,Beautifulsoup,我对使用beautifulsoup和python非常陌生,因此我的问题可能看起来很简单,或者好像我误解了什么,所以请耐心等待 我有一个O365邮箱,我需要在一个设定的时间段内清理它,然后将邮件正文提取成一个JSON字符串,然后通过API传递给web监视器。我有这个与其他一些来源的工作,但对于一封电子邮件,我不能得到所需的细节 导致问题的原始电子邮件如下所示:- <meta http-equiv="Content-Type" content="text/html
<meta http-equiv="Content-Type" content="text/html; charset=us-ascii"><strong>'ExternalCon'</strong> detected on <strong>1</strong> externally facing devices.
<br><br><br>
<a href="https:///">https://</a>
<br><br>
<br>
------------------------------ Query Results (up to 500 records) ------------------------------<br>
<br>
"ip","uuid","repositoryID","score","total","severityInfo","severityLow","severityMedium","severityHigh","severityCritical","macAddress","policyName","pluginSet","netbiosName","dnsName","osCPE","biosGUID","tpmID","mcafeeGUID","lastAuthRun","lastUnauthRun","hostUniqueness","vulnBar","repositoryIDs"<br>
"***.***.***.***","","External IPs","0","1","1","0","0","0","0","","8d6b2cf3-1218-18685270/External_Con (from IO)","101022242455","","","cpe:/o:linux:linux_kernel:X.X","","","","","2605561676","repositoryID,ip,dnsName","1:0:0:0:0","4"
到目前为止,我的代码如下,我知道这是令人讨厌的,将有更好的方法来做到这一点,所以请任何意见将是欢迎的
def extcon():
logfile.write(currentDT.strftime("%H:%M:%S.%f") + " - STARTING IMPORT LOOP FOR ExtCon- lookng for folder . . . ")
timed = calendar.timegm(item.datetime_received.timetuple())
ip = re.findall(r'[0-9]+(?:\.[0-9]+){3}', item.body)
r = item.body
soup = BeautifulSoup(r, 'lxml')
test = soup
links = [a['href'] for a in soup.find_all('a', href=True)]
descr = soup.text
test1 = item.body.split('strong')
test2 = soup.prettify #for string in test.stripped_strings:
test = item.body.split('"')
head = '# ' + test1[1] + "\n\r" + "\n\r" \
+ '-------------------------------------' + "\n\r" + "\n\r" \
+ test1[1] + test1[2] + test1[3] + "\n\r" + "\n\r"
head = head.replace('<', '')
head = head.replace('>', '')
head = head.replace('\'', '')
head = head.replace('/', '')
body = test[1] + ' -> ' + test[2] + "\n\r" + "\n\r" \
+ test[3] + ' -> ' + test[26] + "\n\r" + "\n\r" \
+ test[4] + ' -> ' + test[28] + "\n\r" + "\n\r" \
+ test[11] + ' -> ' + test[35] + "\n\r" + "\n\r" \
+ test[15] + ' -> ' + test[39] + "\n\r" + "\n\r"
def extcon():
logfile.write(currentDT.strftime(“%H:%M:%S.%f”)+“-正在启动ExtCon的导入循环-查找文件夹…”)
timed=calendar.timegm(item.datetime\u received.timetuple())
ip=re.findall(r'[0-9]+(?:\[0-9]+){3},item.body)
r=项目主体
汤=美汤(r,‘lxml’)
测试=汤
links=[a['href']表示汤中的a。全部查找('a',href=True)]
descr=soup.text
test1=item.body.split('strong')
test2=soup.prettify#对于test.stripped_字符串中的字符串:
test=item.body.split(“”)
head='#'+test1[1]+“\n\r”+“\n\r”\
+“---------------------------------------”+“\n\r”+“\n\r”\
+test1[1]+test1[2]+test1[3]+“\n\r”+“\n\r”
头部=头部。更换(“”,“”)
头部=头部。替换('\'','')
头部=头部。更换('/','')
正文=测试[1]+'->'+测试[2]+“\n\r”+“\n\r”\
+测试[3]+'->'+测试[26]+“\n\r”+“\n\r”\
+测试[4]+'->'+测试[28]+“\n\r”+“\n\r”\
+测试[11]+'->'+测试[35]+“\n\r”+“\n\r”\
+测试[15]+'->'+测试[39]+“\n\r”+“\n\r”
我得到的是测试拆分邮件,并给我一个包含97项的列表,但当我尝试单独调用它们以将它们添加到“body”var时,它向我展示了比我预期的多得多的东西,而且我也不确定当工具发现多个extcon时如何处理。我做错了什么
提前谢谢
def extcon():
logfile.write(currentDT.strftime("%H:%M:%S.%f") + " - STARTING IMPORT LOOP FOR ExtCon- lookng for folder . . . ")
timed = calendar.timegm(item.datetime_received.timetuple())
ip = re.findall(r'[0-9]+(?:\.[0-9]+){3}', item.body)
r = item.body
soup = BeautifulSoup(r, 'lxml')
test = soup
links = [a['href'] for a in soup.find_all('a', href=True)]
descr = soup.text
test1 = item.body.split('strong')
test2 = soup.prettify #for string in test.stripped_strings:
test = item.body.split('"')
head = '# ' + test1[1] + "\n\r" + "\n\r" \
+ '-------------------------------------' + "\n\r" + "\n\r" \
+ test1[1] + test1[2] + test1[3] + "\n\r" + "\n\r"
head = head.replace('<', '')
head = head.replace('>', '')
head = head.replace('\'', '')
head = head.replace('/', '')
body = test[1] + ' -> ' + test[2] + "\n\r" + "\n\r" \
+ test[3] + ' -> ' + test[26] + "\n\r" + "\n\r" \
+ test[4] + ' -> ' + test[28] + "\n\r" + "\n\r" \
+ test[11] + ' -> ' + test[35] + "\n\r" + "\n\r" \
+ test[15] + ' -> ' + test[39] + "\n\r" + "\n\r"