用Python解析邮件正文
我正在使用安然数据集,我感兴趣的是将干净的电子邮件正文提取到一个列表中,并将每个答案作为一个字符串保存在列表中。例如 有关以下电子邮件:用Python解析邮件正文,python,Python,我正在使用安然数据集,我感兴趣的是将干净的电子邮件正文提取到一个列表中,并将每个答案作为一个字符串保存在列表中。例如 有关以下电子邮件: Message-ID: <12626409.1075857596370.JavaMail.evans@thyme> Date: Tue, 17 Oct 2000 10:36:00 -0700 (PDT) From: john.arnold@enron.com To: jenwhite7@zdnetonebox.com Subject: Re: Hi
Message-ID: <12626409.1075857596370.JavaMail.evans@thyme>
Date: Tue, 17 Oct 2000 10:36:00 -0700 (PDT)
From: john.arnold@enron.com
To: jenwhite7@zdnetonebox.com
Subject: Re: Hi
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-From: John Arnold
X-To: "Jennifer White" <jenwhite7@zdnetonebox.com> @ ENRON
X-cc:
X-bcc:
X-Folder: \John_Arnold_Dec2000\Notes Folders\'sent mail
X-Origin: Arnold-J
X-FileName: Jarnold.nsf
So, what is it? And by the way, don't start with the excuses. You're
expected to be a full, gourmet cook.
Kisses, not music, makes cooking a more enjoyable experience.
"Jennifer White" <jenwhite7@zdnetonebox.com> on 10/17/2000 04:19:20 PM
To: jarnold@enron.com
cc:
Subject: Hi
I told you I have a long email address.
I've decided what to prepare for dinner tomorrow. I hope you aren't
expecting anything extravagant because my culinary skills haven't been
put to use in a while. My only request is that your stereo works. Music
makes cooking a more enjoyable experience.
Watch the debate if you are home tonight. I want a report tomorrow...
Jen
___________________________________________________________________
To get your own FREE ZDNet Onebox - FREE voicemail, email, and fax,
all in one place - sign up today at http://www.zdnetonebox.com
其中,列表中的第一个元素是:
"So what is it? And by the way don't start with the excuses. You're
expected to be a full gourmet cook. Kisses not music makes cooking a more enjoyable experience."
有没有一个图书馆能够做到这一点
我尝试过python电子邮件库,但我似乎没有该功能,
因为我得到了全身的回应:
["So what is it? And by the way don't start with the excuses. You're
expected to be a full gourmet cook. Kisses not music makes cooking a more enjoyable experience.",
"I told you I have a long email address. I've decided what to prepare for dinner tomorrow. I hope you aren't
expecting anything extravagant because my culinary skills haven't been
put to use in a while. My only request is that your stereo works. Music
makes cooking a more enjoyable experience. Watch the debate if you are home tonight. I want a report tomorrow...
Jen"]
import email
message = data_
e = email.message_from_string(message)
print (e.get_payload())
那么,这是什么?顺便说一句,不要从借口开始。您\n被期望成为一名丰盛的美食厨师。\n\n是问题,不是音乐, 使烹饪成为一种更愉快的体验\n\n\n\n\n“詹妮弗·怀特” jenwhite7@zdnetonebox.com于2000年10月17日下午4:19:20\n发送至: jarnold@enron.com\ncc:\n对象:嗨\n\n\n我告诉过你我有一个很长的 电子邮件地址。\n\n我已经决定明天晚餐准备什么。 我希望你不要因为我的厨艺而奢靡 已经有一段时间没有使用技能了。我唯一的要求是 你的立体声效果很好。音乐\n使烹饪变得更加有趣 体验。\n\n如果今晚在家,请观看辩论。我想要一个 汇报 明天…\nJen\n\n\uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu 在一个地方-今天在注册http://www.zdnetonebox.com\n\n\n'
很抱歉,您当前的电子邮件格式无法解码,因为无法区分电子邮件的标题
"Jennifer White" <jenwhite7@zdnetonebox.com> on 10/17/2000 04:19:20 PM
To: jarnold@enron.com
cc:
Subject: Hi
“詹妮弗·怀特”于2000年10月17日04:19:20
致:jarnold@enron.com
复写的副本:
主题:嗨
因为电子邮件的实际部分可能由于某种原因而具有该字符串,您如何判断哪一个是真实的正文部分或标题部分。如我所见,您在这里有几个选项
text=re.sub(r'[\s]+','',text)
删除出现的\n
,然后大概继续修复\'
的所有情况以及从分隔线向下的所有情况。这似乎是最简单的解决方案,但有局限性,所有这些(从您的示例中)都可以通过使用regex/grep/awk的一些技巧来解决
我将假设您在一个.csv文件中拥有所有安然电子邮件消息,这是此数据集的常见格式。在处理这条消息时,我注意到一些数据清理问题,主要围绕消息中的“\n”进行。我正在想办法解决这个小问题
import re as regex
def expunge_doublespaces(raw_string):
if ' ' not in raw_string:
return raw_string
return expunge_doublespaces(raw_string.replace(' ', ' '))
def parse_raw_email_message(raw_message):
lines = raw_message.splitlines()
email = {}
message = ''
keys_to_extract = ['from', 'to']
for line in lines:
if ':' not in line:
message += line
email['body'] = message
else:
pairs = line.split(':')
key = pairs[0].lower()
val = pairs[1].strip()
if key in keys_to_extract:
email[key] = val
return email
###############################################
# change this open section to fit your dataset
###############################################
with open('enron_emails/sample_email.txt', 'r') as in_file:
parsed_email = parse_raw_email_message(in_file.read())
for key, value in parsed_email.items():
if key == "body":
# this regex add whitespace around single periods and words that end in 't.
first_cleaning = regex.sub(r"(?<=('t)(?=[^\s]))|(?<=[.,])(?=[^\s])", r' ', value)
cleaned_body = expunge_doublespaces(first_cleaning)
print(cleaned_body)
# print output
So, what is it? And by the way, don't start with the excuses. You're
expected to be a full, gourmet cook. Kisses, not music, makes cooking
a more enjoyable experience. I told you I have a long email address.
I've decided what to prepare for dinner tomorrow. I hope you aren't
expecting anything extravagant because my culinary skills haven't
beenput to use in a while. My only request is that your stereo works.
Musicmakes cooking a more enjoyable experience. Watch the debate if
you are home tonight. I want a report tomorrow. . . Jen
这实际上遵循了电子邮件的规范-使标题与正文不同的是一个带有\r\n(CRLF)的空行-检查:问题在于电子邮件正文实际上是自由格式的,需要做一些模糊的事情。内部电子邮件可以用不同的方式引用。这将永远是一件艰难的事情。因此,虽然精确解码是“不可能的”,但一个处理最常见引用样式的好算法可以使99%以上的电子邮件正确排序。如果邮件的某一部分与实际引用非常相似,以致于被解释为一个引用,那么它可能就应该算作一个引用。但是,根据问题,实际的电子邮件不包括\r\n,只有\n。我的回答解决了您的问题吗?如果是,请接受我的建议。如果没有,请具体跟进,以便解决任何悬而未决的问题。谢谢
import re as regex
import email
def expunge_doublespaces(raw_string):
if ' ' not in raw_string:
return raw_string
return expunge_doublespaces(raw_string.replace(' ', ' '))
with open('enron_emails/sample_email.txt', 'r') as input:
email_body = ''
raw_message = input.read()
# Return a message object structure from a string
msg = email.message_from_string(raw_message)
# iterate over all the parts and subparts of a message object tree
for part in msg.walk():
# Return the message’s content type.
if part.get_content_type() == 'text/plain':
email_body = part.get_payload()
first_cleaning = regex.sub(r"((\W\w+\W).*(\d{2}:\d{2}:\d{2})\s(AM|PM)\n(To:.*)\n(cc:.*)\n(Subject:.*))", r' ',
email_body)
clean_body = expunge_doublespaces(first_cleaning.replace('\n', ' '))
print(clean_body)
# print output
So, what is it? And by the way, don't start with the excuses.
You're expected to be a full, gourmet cook. Kisses, not music,
makes cooking a more enjoyable experience. I told you I have a
long email address. I've decided what to prepare for dinner
tomorrow. I hope you aren't expecting anything extravagant
because my culinary skills haven't been put to use in a while.
My only request is that your stereo works. Music makes cooking a
more enjoyable experience. Watch the debate if you are home
tonight. I want a report tomorrow... Jen