用Python解析邮件正文

用Python解析邮件正文,python,Python,我正在使用安然数据集,我感兴趣的是将干净的电子邮件正文提取到一个列表中,并将每个答案作为一个字符串保存在列表中。例如 有关以下电子邮件: Message-ID: <12626409.1075857596370.JavaMail.evans@thyme> Date: Tue, 17 Oct 2000 10:36:00 -0700 (PDT) From: john.arnold@enron.com To: jenwhite7@zdnetonebox.com Subject: Re: Hi

我正在使用安然数据集,我感兴趣的是将干净的电子邮件正文提取到一个列表中,并将每个答案作为一个字符串保存在列表中。例如

有关以下电子邮件:

Message-ID: <12626409.1075857596370.JavaMail.evans@thyme>
Date: Tue, 17 Oct 2000 10:36:00 -0700 (PDT)
From: john.arnold@enron.com
To: jenwhite7@zdnetonebox.com
Subject: Re: Hi
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-From: John Arnold
X-To: "Jennifer White" <jenwhite7@zdnetonebox.com> @ ENRON
X-cc: 
X-bcc: 
X-Folder: \John_Arnold_Dec2000\Notes Folders\'sent mail
X-Origin: Arnold-J
X-FileName: Jarnold.nsf

So, what is it?   And by the way, don't start with the excuses.   You're 
expected to be a full, gourmet cook.

Kisses, not music, makes cooking a more enjoyable experience.  




"Jennifer White" <jenwhite7@zdnetonebox.com> on 10/17/2000 04:19:20 PM
To: jarnold@enron.com
cc:  
Subject: Hi


I told you I have a long email address.

I've decided what to prepare for dinner tomorrow.  I hope you aren't
expecting anything extravagant because my culinary skills haven't been
put to use in a while.  My only request is that your stereo works.  Music
makes cooking a more enjoyable experience.

Watch the debate if you are home tonight.  I want a report tomorrow...
Jen

___________________________________________________________________
To get your own FREE ZDNet Onebox - FREE voicemail, email, and fax,
all in one place - sign up today at http://www.zdnetonebox.com
其中,列表中的第一个元素是:

"So what is it?   And by the way  don't start with the excuses.   You're 
expected to be a full  gourmet cook. Kisses  not music  makes cooking a more enjoyable experience."
有没有一个图书馆能够做到这一点

我尝试过python电子邮件库,但我似乎没有该功能, 因为我得到了全身的回应:

["So what is it?   And by the way  don't start with the excuses.   You're 
expected to be a full  gourmet cook. Kisses  not music  makes cooking a more enjoyable experience.", 
"I told you I have a long email address. I've decided what to prepare for dinner tomorrow.  I hope you aren't 
expecting anything extravagant because my culinary skills haven't been
put to use in a while.  My only request is that your stereo works.  Music
makes cooking a more enjoyable experience. Watch the debate if you are home tonight.  I want a report tomorrow...
Jen"]
import email
message = data_
e = email.message_from_string(message)
print (e.get_payload())
那么,这是什么?顺便说一句,不要从借口开始。
您\n被期望成为一名丰盛的美食厨师。\n\n是问题,不是音乐, 使烹饪成为一种更愉快的体验\n\n\n\n\n“詹妮弗·怀特” jenwhite7@zdnetonebox.com于2000年10月17日下午4:19:20\n发送至: jarnold@enron.com\ncc:\n对象:嗨\n\n\n我告诉过你我有一个很长的 电子邮件地址。\n\n我已经决定明天晚餐准备什么。 我希望你不要因为我的厨艺而奢靡 已经有一段时间没有使用技能了。我唯一的要求是 你的立体声效果很好。音乐\n使烹饪变得更加有趣 体验。\n\n如果今晚在家,请观看辩论。我想要一个 汇报 明天…\nJen\n\n\uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu 在一个地方-今天在注册http://www.zdnetonebox.com\n\n\n'


很抱歉,您当前的电子邮件格式无法解码,因为无法区分电子邮件的标题

"Jennifer White" <jenwhite7@zdnetonebox.com> on 10/17/2000 04:19:20 PM
To: jarnold@enron.com
cc:  
Subject: Hi
“詹妮弗·怀特”于2000年10月17日04:19:20
致:jarnold@enron.com
复写的副本:
主题:嗨

因为电子邮件的实际部分可能由于某种原因而具有该字符串,您如何判断哪一个是真实的正文部分或标题部分。

如我所见,您在这里有几个选项

  • 使用现有功能,但添加第二个函数来执行一些文本处理。例如
    text=re.sub(r'[\s]+','',text)
    删除出现的
    \n
    ,然后大概继续修复
    \'
    的所有情况以及从分隔线向下的所有情况。这似乎是最简单的解决方案,但有局限性,所有这些(从您的示例中)都可以通过使用regex/grep/awk的一些技巧来解决

  • 另一个图书馆(如你所问)。我知道-你可能可以从它的名字猜出它做了什么,但它也解析了他们的电子邮件。同样,一些后期处理可能是有序的,但看起来使用标题(例如日期和正文)的组合应该可以完成您需要的大部分工作

  • Web服务,例如


  • 我将假设您在一个.csv文件中拥有所有安然电子邮件消息,这是此数据集的常见格式。在处理这条消息时,我注意到一些数据清理问题,主要围绕消息中的“\n”进行。我正在想办法解决这个小问题

    import re as regex
    
    def expunge_doublespaces(raw_string):
       if '  ' not in raw_string:
          return raw_string
       return expunge_doublespaces(raw_string.replace('  ', ' '))
    
    
    def parse_raw_email_message(raw_message):
       lines = raw_message.splitlines()
       email = {}
       message = ''
       keys_to_extract = ['from', 'to']
       for line in lines:
          if ':' not in line:
            message += line
            email['body'] = message
    
          else:
             pairs = line.split(':')
             key = pairs[0].lower()
             val = pairs[1].strip()
             if key in keys_to_extract:
                email[key] = val
       return email
    
    ###############################################
    # change this open section to fit your dataset
    ###############################################
    with open('enron_emails/sample_email.txt', 'r') as in_file:
       parsed_email = parse_raw_email_message(in_file.read())
       for key, value in parsed_email.items():
         if key == "body":
            # this regex add whitespace around single periods and words that end in 't.
            first_cleaning = regex.sub(r"(?<=('t)(?=[^\s]))|(?<=[.,])(?=[^\s])", r' ', value)
            cleaned_body = expunge_doublespaces(first_cleaning)
            print(cleaned_body)
            # print output
            So, what is it? And by the way, don't start with the excuses. You're
            expected to be a full, gourmet cook. Kisses, not music, makes cooking
            a more enjoyable experience. I told you I have a long email address.
            I've decided what to prepare for dinner tomorrow. I hope you aren't
            expecting anything extravagant because my culinary skills haven't 
            beenput to use in a while. My only request is that your stereo works. 
            Musicmakes cooking a more enjoyable experience. Watch the debate if 
            you are home tonight. I want a report tomorrow. . . Jen
    

    这实际上遵循了电子邮件的规范-使标题与正文不同的是一个带有\r\n(CRLF)的空行-检查:问题在于电子邮件正文实际上是自由格式的,需要做一些模糊的事情。内部电子邮件可以用不同的方式引用。这将永远是一件艰难的事情。因此,虽然精确解码是“不可能的”,但一个处理最常见引用样式的好算法可以使99%以上的电子邮件正确排序。如果邮件的某一部分与实际引用非常相似,以致于被解释为一个引用,那么它可能就应该算作一个引用。但是,根据问题,实际的电子邮件不包括\r\n,只有\n。我的回答解决了您的问题吗?如果是,请接受我的建议。如果没有,请具体跟进,以便解决任何悬而未决的问题。谢谢
    import re as regex
    import email
    
    def expunge_doublespaces(raw_string):
       if '  ' not in raw_string:
         return raw_string
       return expunge_doublespaces(raw_string.replace('  ', ' '))
    
    with open('enron_emails/sample_email.txt', 'r') as input:
        email_body = ''
        raw_message = input.read()
    
        # Return a message object structure from a string
        msg = email.message_from_string(raw_message)
    
        # iterate over all the parts and subparts of a message object tree
        for part in msg.walk():
    
        # Return the message’s content type.
        if part.get_content_type() == 'text/plain':
          email_body = part.get_payload()
          first_cleaning = regex.sub(r"((\W\w+\W).*(\d{2}:\d{2}:\d{2})\s(AM|PM)\n(To:.*)\n(cc:.*)\n(Subject:.*))", r' ',
                         email_body)
          clean_body = expunge_doublespaces(first_cleaning.replace('\n', ' '))
          print(clean_body)
          # print output
          So, what is it? And by the way, don't start with the excuses. 
          You're expected to be a full, gourmet cook. Kisses, not music, 
          makes cooking a more enjoyable experience. I told you I have a 
          long email address. I've decided what to prepare for dinner 
          tomorrow. I hope you aren't expecting anything extravagant 
          because my culinary skills haven't been put to use in a while. 
          My only request is that your stereo works. Music makes cooking a 
          more enjoyable experience. Watch the debate if you are home 
          tonight. I want a report tomorrow... Jen