Python 使用BeautifulSoup返回正文文本_Python_Email_Web Scraping_Beautifulsoup

Python 使用BeautifulSoup返回正文文本

python email web-scraping

Python 使用BeautifulSoup返回正文文本,python,email,web-scraping,beautifulsoup,Python,Email,Web Scraping,Beautifulsoup,我正在尝试使用BeautifulSoup从使用ExchangeLib返回的内容中删除HTML标记。到目前为止，我得到的是： from exchangelib import Credentials, Account import urllib3 from bs4 import BeautifulSoup credentials = Credentials('myemail@notreal.com', 'topSecret') account = Account('myemail@notreal.

我正在尝试使用BeautifulSoup从使用ExchangeLib返回的内容中删除HTML标记。到目前为止，我得到的是：

from exchangelib import Credentials, Account
import urllib3
from bs4 import BeautifulSoup

credentials = Credentials('myemail@notreal.com', 'topSecret')
account = Account('myemail@notreal.com', credentials=credentials, autodiscover=True)

for item in account.inbox.all().order_by('-datetime_received')[:1]:
    soup = BeautifulSoup(item.unique_body, 'html.parser')
    print(soup)

按原样，这将使用exchangeLib通过Exchange从我的收件箱中获取第一封电子邮件，并专门打印包含电子邮件正文文本的唯一_正文。以下是printsoup的输出示例：

从我在BeautifulSoup文档中所读到的内容来看，刮削过程介于我的Soup=行和最终打印行之间

我的问题是，为了运行BeautifulSoup的刮片部分，它需要一个类和h1标记，例如：name_box=soup.find'h1'，attrs={'class'：'name'}，但是从我目前的情况来看，我没有这些

作为Python新手，我应该如何做呢？

您需要打印字体标记内容。您可以使用select方法并将其传递给字体元素的类型选择器

您可以尝试查找所有字体标记值，然后进行迭代

from bs4 import BeautifulSoup
html="""<html><body><div>
<div><span lang="en-US">
<div style="margin:0;"><font face="Calibri,sans-serif" size="2"><span style="font-size:11pt;">Hey John,</span></font></div>
<div style="margin:0;"><font face="Calibri,sans-serif" size="2"><span style="font-size:11pt;"> </span></font></div>
<div style="margin:0;"><font face="Calibri,sans-serif" size="2"><span style="font-size:11pt;">Here is a test email</span></font></div>
</span></div>
</div>
</body></html>"""

soup = BeautifulSoup(html, "html.parser")
for span in soup.find_all('font'):
      print(span.text)

您可以尝试查找所有字体标记值，然后进行迭代

from bs4 import BeautifulSoup
html="""<html><body><div>
<div><span lang="en-US">
<div style="margin:0;"><font face="Calibri,sans-serif" size="2"><span style="font-size:11pt;">Hey John,</span></font></div>
<div style="margin:0;"><font face="Calibri,sans-serif" size="2"><span style="font-size:11pt;"> </span></font></div>
<div style="margin:0;"><font face="Calibri,sans-serif" size="2"><span style="font-size:11pt;">Here is a test email</span></font></div>
</span></div>
</div>
</body></html>"""

soup = BeautifulSoup(html, "html.parser")
for span in soup.find_all('font'):
      print(span.text)

尝试过之后，不得不将bshtml“lxml”改为BeautifulSouphtml“lxml”。我的新输出是：[“Hey John”，“\xa0”，“这是一封测试电子邮件”]好得多，但仍然需要做一些轻微的修复，我在import语句中将BeautifulSoup别名为bs，这就是为什么它不同的原因。我没有得到“\xa0”值，所以我猜可能与您显示的有所不同？这些剩余的修复是什么？目前我似乎得到了预期的结果。select方法也很快。您总是可以用.replace'\xa0'，''删除unicode，但我仍然不知道您是如何得到它的，因为我的打印非常完美。你在语言设置上可能有一些不同吗？试过了，不得不将bshtml的“lxml”改为BeautifulSouphtml的“lxml”。我的新输出是：[“Hey John”，“\xa0”，“这是一封测试电子邮件”]好得多，但仍然需要做一些轻微的修复，我在import语句中将BeautifulSoup别名为bs，这就是为什么它不同的原因。我没有得到“\xa0”值，所以我猜可能与您显示的有所不同？这些剩余的修复是什么？目前我似乎得到了预期的结果。select方法也很快。您总是可以用.replace'\xa0'，''删除unicode，但我仍然不知道您是如何得到它的，因为我的打印非常完美。你在语言设置上有什么不同吗？这很有效！输出正是您所说的。非常感谢。这成功了！输出正是您所说的。非常感谢。

from bs4 import BeautifulSoup as bs

html = '''
<html><body><div>
<div><span lang="en-US">
<div style="margin:0;"><font face="Calibri,sans-serif" size="2"><span style="font-size:11pt;">Hey John,</span></font></div>
<div style="margin:0;"><font face="Calibri,sans-serif" size="2"><span style="font-size:11pt;"> </span></font></div>
<div style="margin:0;"><font face="Calibri,sans-serif" size="2"><span style="font-size:11pt;">Here is a test email</span></font></div>
</span></div>
</div>
</body></html>
'''

soup = bs(html, 'lxml')

textStuff = [item.text for item in soup.select('font') if item.text != ' ']
print(textStuff)

from bs4 import BeautifulSoup
html="""<html><body><div>
<div><span lang="en-US">
<div style="margin:0;"><font face="Calibri,sans-serif" size="2"><span style="font-size:11pt;">Hey John,</span></font></div>
<div style="margin:0;"><font face="Calibri,sans-serif" size="2"><span style="font-size:11pt;"> </span></font></div>
<div style="margin:0;"><font face="Calibri,sans-serif" size="2"><span style="font-size:11pt;">Here is a test email</span></font></div>
</span></div>
</div>
</body></html>"""

soup = BeautifulSoup(html, "html.parser")
for span in soup.find_all('font'):
      print(span.text)

Hey John,

Here is a test email