HTML电子邮件中的Python和BeautifulSoup抓取

HTML电子邮件中的Python和BeautifulSoup抓取,python,beautifulsoup,Python,Beautifulsoup,我对使用beautifulsoup和python非常陌生,因此我的问题可能看起来很简单,或者好像我误解了什么,所以请耐心等待 我有一个O365邮箱,我需要在一个设定的时间段内清理它,然后将邮件正文提取成一个JSON字符串,然后通过API传递给web监视器。我有这个与其他一些来源的工作,但对于一封电子邮件,我不能得到所需的细节 导致问题的原始电子邮件如下所示:- <meta http-equiv="Content-Type" content="text/html

我对使用beautifulsoup和python非常陌生,因此我的问题可能看起来很简单,或者好像我误解了什么,所以请耐心等待

我有一个O365邮箱,我需要在一个设定的时间段内清理它,然后将邮件正文提取成一个JSON字符串,然后通过API传递给web监视器。我有这个与其他一些来源的工作,但对于一封电子邮件,我不能得到所需的细节

导致问题的原始电子邮件如下所示:-

<meta http-equiv="Content-Type" content="text/html; charset=us-ascii"><strong>'ExternalCon'</strong>  detected on <strong>1</strong> externally facing devices.

<br><br><br>

<a href="https:///">https://</a>
<br><br>
<br>
------------------------------ Query Results (up to 500 records) ------------------------------<br>
<br>
&quot;ip&quot;,&quot;uuid&quot;,&quot;repositoryID&quot;,&quot;score&quot;,&quot;total&quot;,&quot;severityInfo&quot;,&quot;severityLow&quot;,&quot;severityMedium&quot;,&quot;severityHigh&quot;,&quot;severityCritical&quot;,&quot;macAddress&quot;,&quot;policyName&quot;,&quot;pluginSet&quot;,&quot;netbiosName&quot;,&quot;dnsName&quot;,&quot;osCPE&quot;,&quot;biosGUID&quot;,&quot;tpmID&quot;,&quot;mcafeeGUID&quot;,&quot;lastAuthRun&quot;,&quot;lastUnauthRun&quot;,&quot;hostUniqueness&quot;,&quot;vulnBar&quot;,&quot;repositoryIDs&quot;<br>

&quot;***.***.***.***&quot;,&quot;&quot;,&quot;External IPs&quot;,&quot;0&quot;,&quot;1&quot;,&quot;1&quot;,&quot;0&quot;,&quot;0&quot;,&quot;0&quot;,&quot;0&quot;,&quot;&quot;,&quot;8d6b2cf3-1218-18685270/External_Con (from IO)&quot;,&quot;101022242455&quot;,&quot;&quot;,&quot;&quot;,&quot;cpe:/o:linux:linux_kernel:X.X&quot;,&quot;&quot;,&quot;&quot;,&quot;&quot;,&quot;&quot;,&quot;2605561676&quot;,&quot;repositoryID,ip,dnsName&quot;,&quot;1:0:0:0:0&quot;,&quot;4&quot;
到目前为止,我的代码如下,我知道这是令人讨厌的,将有更好的方法来做到这一点,所以请任何意见将是欢迎的

def extcon():
    logfile.write(currentDT.strftime("%H:%M:%S.%f") + " - STARTING IMPORT LOOP FOR ExtCon- lookng for folder . . . ")       
    timed = calendar.timegm(item.datetime_received.timetuple())
    ip = re.findall(r'[0-9]+(?:\.[0-9]+){3}', item.body) 
    r = item.body
    soup = BeautifulSoup(r, 'lxml')
    test = soup
    links = [a['href'] for a in soup.find_all('a', href=True)]
    descr = soup.text 
    test1 = item.body.split('strong')
    test2 = soup.prettify #for string in test.stripped_strings:
    test = item.body.split('&quot;')
    head = '# ' + test1[1] + "\n\r" + "\n\r" \
        + '-------------------------------------' + "\n\r" + "\n\r" \
        + test1[1] + test1[2] + test1[3] + "\n\r" + "\n\r" 
    head = head.replace('<', '')
    head = head.replace('>', '')
    head = head.replace('\'', '')
    head = head.replace('/', '')
    body = test[1] + ' -> ' + test[2] + "\n\r" + "\n\r" \
        + test[3] + ' -> ' + test[26] + "\n\r" + "\n\r" \
        + test[4] + ' -> ' + test[28] + "\n\r" + "\n\r" \
        + test[11] + ' -> ' + test[35] + "\n\r" + "\n\r" \
        + test[15] + ' -> ' + test[39] + "\n\r" + "\n\r" 
def extcon():
logfile.write(currentDT.strftime(“%H:%M:%S.%f”)+“-正在启动ExtCon的导入循环-查找文件夹…”)
timed=calendar.timegm(item.datetime\u received.timetuple())
ip=re.findall(r'[0-9]+(?:\[0-9]+){3},item.body)
r=项目主体
汤=美汤(r,‘lxml’)
测试=汤
links=[a['href']表示汤中的a。全部查找('a',href=True)]
descr=soup.text
test1=item.body.split('strong')
test2=soup.prettify#对于test.stripped_字符串中的字符串:
test=item.body.split(“”)
head='#'+test1[1]+“\n\r”+“\n\r”\
+“---------------------------------------”+“\n\r”+“\n\r”\
+test1[1]+test1[2]+test1[3]+“\n\r”+“\n\r”
头部=头部。更换(“”,“”)
头部=头部。替换('\'','')
头部=头部。更换('/','')
正文=测试[1]+'->'+测试[2]+“\n\r”+“\n\r”\
+测试[3]+'->'+测试[26]+“\n\r”+“\n\r”\
+测试[4]+'->'+测试[28]+“\n\r”+“\n\r”\
+测试[11]+'->'+测试[35]+“\n\r”+“\n\r”\
+测试[15]+'->'+测试[39]+“\n\r”+“\n\r”
我得到的是测试拆分邮件,并给我一个包含97项的列表,但当我尝试单独调用它们以将它们添加到“body”var时,它向我展示了比我预期的多得多的东西,而且我也不确定当工具发现多个extcon时如何处理。我做错了什么

提前谢谢

def extcon():
    logfile.write(currentDT.strftime("%H:%M:%S.%f") + " - STARTING IMPORT LOOP FOR ExtCon- lookng for folder . . . ")       
    timed = calendar.timegm(item.datetime_received.timetuple())
    ip = re.findall(r'[0-9]+(?:\.[0-9]+){3}', item.body) 
    r = item.body
    soup = BeautifulSoup(r, 'lxml')
    test = soup
    links = [a['href'] for a in soup.find_all('a', href=True)]
    descr = soup.text 
    test1 = item.body.split('strong')
    test2 = soup.prettify #for string in test.stripped_strings:
    test = item.body.split('&quot;')
    head = '# ' + test1[1] + "\n\r" + "\n\r" \
        + '-------------------------------------' + "\n\r" + "\n\r" \
        + test1[1] + test1[2] + test1[3] + "\n\r" + "\n\r" 
    head = head.replace('<', '')
    head = head.replace('>', '')
    head = head.replace('\'', '')
    head = head.replace('/', '')
    body = test[1] + ' -> ' + test[2] + "\n\r" + "\n\r" \
        + test[3] + ' -> ' + test[26] + "\n\r" + "\n\r" \
        + test[4] + ' -> ' + test[28] + "\n\r" + "\n\r" \
        + test[11] + ' -> ' + test[35] + "\n\r" + "\n\r" \
        + test[15] + ' -> ' + test[39] + "\n\r" + "\n\r"