Warning: file_get_contents(/data/phpspider/zhask/data//catemap/0/email/3.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python:如何解析原始电子邮件中的正文,因为原始电子邮件没有;正文“;标签什么的_Python_Email_Python 2.7_Mod Wsgi_Wsgi - Fatal编程技术网

Python:如何解析原始电子邮件中的正文,因为原始电子邮件没有;正文“;标签什么的

Python:如何解析原始电子邮件中的正文,因为原始电子邮件没有;正文“;标签什么的,python,email,python-2.7,mod-wsgi,wsgi,Python,Email,Python 2.7,Mod Wsgi,Wsgi,这似乎很容易得到答案 From To Subject 等通过 import email b = email.message_from_string(a) bbb = b['from'] ccc = b['to'] 假设“a”是原始电子邮件字符串,看起来像这样 a = """From root@a1.local.tld Thu Jul 25 19:28:59 2013 Received: from a1.local.tld (localhost [127.0.0.1]) by a1.l

这似乎很容易得到答案

From
To
Subject
等通过

import email
b = email.message_from_string(a)
bbb = b['from']
ccc = b['to']
假设
“a”
是原始电子邮件字符串,看起来像这样

a = """From root@a1.local.tld Thu Jul 25 19:28:59 2013
Received: from a1.local.tld (localhost [127.0.0.1])
    by a1.local.tld (8.14.4/8.14.4) with ESMTP id r6Q2SxeQ003866
    for <ooo@a1.local.tld>; Thu, 25 Jul 2013 19:28:59 -0700
Received: (from root@localhost)
    by a1.local.tld (8.14.4/8.14.4/Submit) id r6Q2Sxbh003865;
    Thu, 25 Jul 2013 19:28:59 -0700
From: root@a1.local.tld
Subject: oooooooooooooooo
To: ooo@a1.local.tld
Cc: 
X-Originating-IP: 192.168.15.127
X-Mailer: Webmin 1.420
Message-Id: <1374805739.3861@a1>
Date: Thu, 25 Jul 2013 19:28:59 -0700 (PDT)
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="bound1374805739"

This is a multi-part message in MIME format.

--bound1374805739
Content-Type: text/plain
Content-Transfer-Encoding: 7bit

ooooooooooooooooooooooooooooooooooooooooooooooo
ooooooooooooooooooooooooooooooooooooooooooooooo
ooooooooooooooooooooooooooooooooooooooooooooooo

--bound1374805739--"""
这是正确的方法吗

或者有更简单的方法,比如

import email
b = email.message_from_string(a)
bbb = b['body']
?使用


python中没有
b['body']
。你必须使用get_有效载荷

if isinstance(mailEntity.get_payload(), list):
    for eachPayload in mailEntity.get_payload():
        ...do things you want...
        ...real mail body is in eachPayload.get_payload()...
else:
    ...means there is only text/plain part....
    ...use mailEntity.get_payload() to get the body...

祝你好运。

要非常积极地处理实际的电子邮件正文(但仍有可能没有解析正确的部分),你必须跳过附件,并将重点放在普通或html部分(取决于你的需要)以进行进一步处理

由于前面提到的附件可以而且通常是文本/纯文本或文本/html部分,此非防弹示例通过检查内容处置标题跳过这些附件:

b = email.message_from_string(a)
body = ""

if b.is_multipart():
    for part in b.walk():
        ctype = part.get_content_type()
        cdispo = str(part.get('Content-Disposition'))

        # skip any text/plain (txt) attachments
        if ctype == 'text/plain' and 'attachment' not in cdispo:
            body = part.get_payload(decode=True)  # decode
            break
# not multipart - i.e. plain text, no attachments, keeping fingers crossed
else:
    body = b.get_payload(decode=True)
顺便说一句,
walk()
对mime部件进行了奇妙的迭代,并且
get\u payload(decode=True)
为您完成了解码base64等方面的脏活

一些背景——正如我所暗示的,MIME电子邮件的奇妙世界存在很多“错误地”找到邮件正文的陷阱。 在最简单的情况下,它位于唯一的“text/plain”部分,get_payload()非常诱人,但我们并不生活在一个简单的世界中——它经常被多部分/可选、相关、混合等内容所包围。维基百科对它进行了严格的描述,但是考虑到以下所有这些情况都是有效的-也是常见的——一个人必须考虑安全网:

非常常见-与普通编辑器(Gmail、Outlook)发送带附件的格式化文本时得到的效果非常相似:

multipart/mixed
 |
 +- multipart/related
 |   |
 |   +- multipart/alternative
 |   |   |
 |   |   +- text/plain
 |   |   +- text/html
 |   |      
 |   +- image/png
 |
 +-- application/msexcel
相对简单-只是替代表示法:

multipart/alternative
 |
 +- text/plain
 +- text/html
无论好坏,此结构也有效:

multipart/alternative
 |
 +- text/plain
 +- multipart/related
      |
      +- text/html
      +- image/jpeg
希望这有点帮助

另外,我的观点是不要轻率地对待电子邮件——它会在你最不经意的时候咬人:)

有很好的方法可以用适当的文档解析电子邮件内容

import mailparser

mail = mailparser.parse_from_file(f)
mail = mailparser.parse_from_file_obj(fp)
mail = mailparser.parse_from_string(raw_mail)
mail = mailparser.parse_from_bytes(byte_mail)
如何使用:

mail.attachments: list of all attachments
mail.body
mail.to

如果email是pandas数据框和emails.message,则为email文本列

## Helper functions
def get_text_from_email(msg):
    '''To get the content from email objects'''
    parts = []
    for part in msg.walk():
        if part.get_content_type() == 'text/plain':
            parts.append( part.get_payload() )
    return ''.join(parts)

def split_email_addresses(line):
    '''To separate multiple email addresses'''
    if line:
        addrs = line.split(',')
        addrs = frozenset(map(lambda x: x.strip(), addrs))
    else:
        addrs = None
    return addrs 

import email
# Parse the emails into a list email objects
messages = list(map(email.message_from_string, emails['message']))
emails.drop('message', axis=1, inplace=True)
# Get fields from parsed email objects
keys = messages[0].keys()
for key in keys:
    emails[key] = [doc[key] for doc in messages]
# Parse content from emails
emails['content'] = list(map(get_text_from_email, messages))
# Split multiple email addresses
emails['From'] = emails['From'].map(split_email_addresses)
emails['To'] = emails['To'].map(split_email_addresses)

# Extract the root of 'file' as 'user'
emails['user'] = emails['file'].map(lambda x:x.split('/')[0])
del messages

emails.head()

以下是每次对我有效的代码(适用于Outlook电子邮件):


Python 3.6+提供了内置的方便方法来查找和解码纯文本正文,如
@Todor Minakov
的答案所示。您可以使用和方法:

注意:如果没有(明显的)纯文本正文部分,这将给出
None

如果您正在读取例如mbox文件,则可以为邮箱构造函数提供一个
EmailMessage
factory:

mbox = mailbox.mbox(mboxfile, factory=lambda f: email.message_from_binary_file(f, policy=email.policy.default), create=False)
for msg in mbox:
    ...

注意:您必须将
email.policy.default
作为策略传递,因为它不是默认的…

感谢您提供了这个完整的示例,并与公认的答案相反,详细说明了一个警告。我认为这是一种更好/更安全的方法。啊,非常好
.get\u payload(decode=True)
而不仅仅是
。get\u payload()
让生活变得更加轻松,谢谢!我只从.get_有效载荷(decode=True)中查找身体。有什么办法吗?库很好,但我必须创建自己的类,该类继承自
MailParser
并重写body方法,因为它将电子邮件正文的各个部分与“\n--mail\u boundary--\n”连接起来,这对我来说并不理想。hi@avram,你能分享一下你编写的类吗?我设法在上拆分了结果“\n--mail\u boundary--\n”。@AmeyPNaik这里我做了一个快速的github要点:@AmeyPNaik在他们的文章中说:邮件解析器可以解析Outlook电子邮件格式(.msg)。若要使用此功能,您需要安装libemail outlook message perl Package。可能需要说明这是用于Windows上的outlook,而不是用于真正的电子邮件。为什么
email.policy.default
不是默认值?似乎应该是。@PartialOrder向后兼容。这将是默认值,您现在应该已经使用了。这是非常重要的信息积极且令人鼓舞,但让我困惑了一段时间。
lambda
并没有立即显示“email.policy”的导入不足,我猜如果您明确访问消息,例如通过
mbox.get_消息(0),则不会咨询工厂
大家还可以注意到,Python 3.6+具有方便的get\u body()功能,更明确的
make\u EmailMessage
factory函数方法通过即将推出的默认解析策略实现的函数,如@Doctor J更新的答案中所述,并注意到Todor Minakov的答案比FalseTru的更健壮,其他答案在更健壮和利用更新的get_body()方面做得更好功能性。@nealmcb,当我回答没有
get_body
)时,似乎是从Python 3.6开始出现的。顺便说一句,这个问题被标记为
Python-2.7
,你不能使用
get_body
这一点很好!当然,对于Python 2,我们可以对现代解决方案产生更多的兴趣。但也要注意正如Todor所描述的,很多电子邮件都有复杂的结构,所以更一般的方法是个好主意,而你的“…”并不是很具体。
## Helper functions
def get_text_from_email(msg):
    '''To get the content from email objects'''
    parts = []
    for part in msg.walk():
        if part.get_content_type() == 'text/plain':
            parts.append( part.get_payload() )
    return ''.join(parts)

def split_email_addresses(line):
    '''To separate multiple email addresses'''
    if line:
        addrs = line.split(',')
        addrs = frozenset(map(lambda x: x.strip(), addrs))
    else:
        addrs = None
    return addrs 

import email
# Parse the emails into a list email objects
messages = list(map(email.message_from_string, emails['message']))
emails.drop('message', axis=1, inplace=True)
# Get fields from parsed email objects
keys = messages[0].keys()
for key in keys:
    emails[key] = [doc[key] for doc in messages]
# Parse content from emails
emails['content'] = list(map(get_text_from_email, messages))
# Split multiple email addresses
emails['From'] = emails['From'].map(split_email_addresses)
emails['To'] = emails['To'].map(split_email_addresses)

# Extract the root of 'file' as 'user'
emails['user'] = emails['file'].map(lambda x:x.split('/')[0])
del messages

emails.head()
#to read Subjects and Body of email in a folder (or subfolder)

import win32com.client  
#import package

outlook = win32com.client.Dispatch("Outlook.Application").GetNamespace("MAPI")  
#create object

#get to the desired folder (MyEmail@xyz.com is my root folder)

root_folder = 
outlook.Folders['MyEmail@xyz.com'].Folders['Inbox'].Folders['SubFolderName']

#('Inbox' and 'SubFolderName' are the subfolders)

messages = root_folder.Items

for message in messages:
if message.Unread == True:    # gets only 'Unread' emails
    subject_content = message.subject
# to store subject lines of mails

    body_content = message.body
# to store Body of mails

    print(subject_content)
    print(body_content)

    message.Unread = True         # mark the mail as 'Read'
    message = messages.GetNext()  #iterate over mails
msg = email.message_from_string(s, policy=email.policy.default)
body = msg.get_body(('plain',))
if body:
    body = body.get_content()
print(body)
mbox = mailbox.mbox(mboxfile, factory=lambda f: email.message_from_binary_file(f, policy=email.policy.default), create=False)
for msg in mbox:
    ...