用Python解析邮件正文_Python - Fatal编程技术网

用Python解析邮件正文

python

用Python解析邮件正文,python,Python,我正在使用安然数据集，我感兴趣的是将干净的电子邮件正文提取到一个列表中，并将每个答案作为一个字符串保存在列表中。例如有关以下电子邮件： Message-ID: <12626409.1075857596370.JavaMail.evans@thyme> Date: Tue, 17 Oct 2000 10:36:00 -0700 (PDT) From: john.arnold@enron.com To: jenwhite7@zdnetonebox.com Subject: Re: Hi

我正在使用安然数据集，我感兴趣的是将干净的电子邮件正文提取到一个列表中，并将每个答案作为一个字符串保存在列表中。例如

有关以下电子邮件：

Message-ID: <12626409.1075857596370.JavaMail.evans@thyme>
Date: Tue, 17 Oct 2000 10:36:00 -0700 (PDT)
From: john.arnold@enron.com
To: jenwhite7@zdnetonebox.com
Subject: Re: Hi
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-From: John Arnold
X-To: "Jennifer White" <jenwhite7@zdnetonebox.com> @ ENRON
X-cc: 
X-bcc: 
X-Folder: \John_Arnold_Dec2000\Notes Folders\'sent mail
X-Origin: Arnold-J
X-FileName: Jarnold.nsf

So, what is it?   And by the way, don't start with the excuses.   You're 
expected to be a full, gourmet cook.

Kisses, not music, makes cooking a more enjoyable experience.  

"Jennifer White" <jenwhite7@zdnetonebox.com> on 10/17/2000 04:19:20 PM
To: jarnold@enron.com
cc:  
Subject: Hi

I told you I have a long email address.

I've decided what to prepare for dinner tomorrow.  I hope you aren't
expecting anything extravagant because my culinary skills haven't been
put to use in a while.  My only request is that your stereo works.  Music
makes cooking a more enjoyable experience.

Watch the debate if you are home tonight.  I want a report tomorrow...
Jen

___________________________________________________________________
To get your own FREE ZDNet Onebox - FREE voicemail, email, and fax,
all in one place - sign up today at http://www.zdnetonebox.com

其中，列表中的第一个元素是：

"So what is it?   And by the way  don't start with the excuses.   You're 
expected to be a full  gourmet cook. Kisses  not music  makes cooking a more enjoyable experience."

有没有一个图书馆能够做到这一点

我尝试过python电子邮件库，但我似乎没有该功能，因为我得到了全身的回应：

["So what is it?   And by the way  don't start with the excuses.   You're 
expected to be a full  gourmet cook. Kisses  not music  makes cooking a more enjoyable experience.", 
"I told you I have a long email address. I've decided what to prepare for dinner tomorrow.  I hope you aren't 
expecting anything extravagant because my culinary skills haven't been
put to use in a while.  My only request is that your stereo works.  Music
makes cooking a more enjoyable experience. Watch the debate if you are home tonight.  I want a report tomorrow...
Jen"]

import email
message = data_
e = email.message_from_string(message)
print (e.get_payload())

那么，这是什么？顺便说一句，不要从借口开始。
您\n被期望成为一名丰盛的美食厨师。\n\n是问题，不是音乐，使烹饪成为一种更愉快的体验\n\n\n\n\n“詹妮弗·怀特” jenwhite7@zdnetonebox.com于2000年10月17日下午4:19:20\n发送至： jarnold@enron.com\ncc:\n对象：嗨\n\n\n我告诉过你我有一个很长的电子邮件地址。\n\n我已经决定明天晚餐准备什么。我希望你不要因为我的厨艺而奢靡已经有一段时间没有使用技能了。我唯一的要求是你的立体声效果很好。音乐\n使烹饪变得更加有趣体验。\n\n如果今晚在家，请观看辩论。我想要一个汇报明天…\nJen\n\n\uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu 在一个地方-今天在注册http://www.zdnetonebox.com\n\n\n'

很抱歉，您当前的电子邮件格式无法解码，因为无法区分电子邮件的标题

"Jennifer White" <jenwhite7@zdnetonebox.com> on 10/17/2000 04:19:20 PM
To: jarnold@enron.com
cc:  
Subject: Hi

“詹妮弗·怀特”于2000年10月17日04:19:20
致：jarnold@enron.com
复写的副本：
主题：嗨

因为电子邮件的实际部分可能由于某种原因而具有该字符串，您如何判断哪一个是真实的正文部分或标题部分。

如我所见，您在这里有几个选项

使用现有功能，但添加第二个函数来执行一些文本处理。例如

text=re.sub（r'[\s]+'，''，text）

删除出现的

\n

，然后大概继续修复

\'

的所有情况以及从分隔线向下的所有情况。这似乎是最简单的解决方案，但有局限性，所有这些（从您的示例中）都可以通过使用regex/grep/awk的一些技巧来解决

另一个图书馆（如你所问）。我知道-你可能可以从它的名字猜出它做了什么，但它也解析了他们的电子邮件。同样，一些后期处理可能是有序的，但看起来使用标题（例如日期和正文）的组合应该可以完成您需要的大部分工作

Web服务，例如

我将假设您在一个.csv文件中拥有所有安然电子邮件消息，这是此数据集的常见格式。在处理这条消息时，我注意到一些数据清理问题，主要围绕消息中的“\n”进行。我正在想办法解决这个小问题

import re as regex

def expunge_doublespaces(raw_string):
   if '  ' not in raw_string:
      return raw_string
   return expunge_doublespaces(raw_string.replace('  ', ' '))


def parse_raw_email_message(raw_message):
   lines = raw_message.splitlines()
   email = {}
   message = ''
   keys_to_extract = ['from', 'to']
   for line in lines:
      if ':' not in line:
        message += line
        email['body'] = message

      else:
         pairs = line.split(':')
         key = pairs[0].lower()
         val = pairs[1].strip()
         if key in keys_to_extract:
            email[key] = val
   return email

###############################################
# change this open section to fit your dataset
###############################################
with open('enron_emails/sample_email.txt', 'r') as in_file:
   parsed_email = parse_raw_email_message(in_file.read())
   for key, value in parsed_email.items():
     if key == "body":
        # this regex add whitespace around single periods and words that end in 't.
        first_cleaning = regex.sub(r"(?<=('t)(?=[^\s]))|(?<=[.,])(?=[^\s])", r' ', value)
        cleaned_body = expunge_doublespaces(first_cleaning)
        print(cleaned_body)
        # print output
        So, what is it? And by the way, don't start with the excuses. You're
        expected to be a full, gourmet cook. Kisses, not music, makes cooking
        a more enjoyable experience. I told you I have a long email address.
        I've decided what to prepare for dinner tomorrow. I hope you aren't
        expecting anything extravagant because my culinary skills haven't 
        beenput to use in a while. My only request is that your stereo works. 
        Musicmakes cooking a more enjoyable experience. Watch the debate if 
        you are home tonight. I want a report tomorrow. . . Jen

这实际上遵循了电子邮件的规范-使标题与正文不同的是一个带有\r\n（CRLF）的空行-检查：问题在于电子邮件正文实际上是自由格式的，需要做一些模糊的事情。内部电子邮件可以用不同的方式引用。这将永远是一件艰难的事情。因此，虽然精确解码是“不可能的”，但一个处理最常见引用样式的好算法可以使99%以上的电子邮件正确排序。如果邮件的某一部分与实际引用非常相似，以致于被解释为一个引用，那么它可能就应该算作一个引用。但是，根据问题，实际的电子邮件不包括\r\n，只有\n。我的回答解决了您的问题吗？如果是，请接受我的建议。如果没有，请具体跟进，以便解决任何悬而未决的问题。谢谢

import re as regex
import email

def expunge_doublespaces(raw_string):
   if '  ' not in raw_string:
     return raw_string
   return expunge_doublespaces(raw_string.replace('  ', ' '))

with open('enron_emails/sample_email.txt', 'r') as input:
    email_body = ''
    raw_message = input.read()

    # Return a message object structure from a string
    msg = email.message_from_string(raw_message)

    # iterate over all the parts and subparts of a message object tree
    for part in msg.walk():

    # Return the message’s content type.
    if part.get_content_type() == 'text/plain':
      email_body = part.get_payload()
      first_cleaning = regex.sub(r"((\W\w+\W).*(\d{2}:\d{2}:\d{2})\s(AM|PM)\n(To:.*)\n(cc:.*)\n(Subject:.*))", r' ',
                     email_body)
      clean_body = expunge_doublespaces(first_cleaning.replace('\n', ' '))
      print(clean_body)
      # print output
      So, what is it? And by the way, don't start with the excuses. 
      You're expected to be a full, gourmet cook. Kisses, not music, 
      makes cooking a more enjoyable experience. I told you I have a 
      long email address. I've decided what to prepare for dinner 
      tomorrow. I hope you aren't expecting anything extravagant 
      because my culinary skills haven't been put to use in a while. 
      My only request is that your stereo works. Music makes cooking a 
      more enjoyable experience. Watch the debate if you are home 
      tonight. I want a report tomorrow... Jen