如何在python中读取eml文件?

如何在python中读取eml文件?,python,eml,Python,Eml,我不知道如何在Python3.4中加载eml文件 我想列出所有内容并用python阅读所有内容 这是获取电子邮件内容的方式,即.*.eml文件。 这在Python2.5-2.7上非常有效。试穿3号。它也应该起作用 from email import message_from_file import os # Path to directory where attachments will be stored: path = "./msgfiles" # To have attachmen

我不知道如何在Python3.4中加载eml文件
我想列出所有内容并用python阅读所有内容


这是获取电子邮件内容的方式,即.*.eml文件。 这在Python2.5-2.7上非常有效。试穿3号。它也应该起作用



from email import message_from_file
import os

# Path to directory where attachments will be stored:
path = "./msgfiles"

# To have attachments extracted into memory, change behaviour of 2 following functions:

def file_exists (f):
    """Checks whether extracted file was extracted before."""
    return os.path.exists(os.path.join(path, f))

def save_file (fn, cont):
    """Saves cont to a file fn"""
    file = open(os.path.join(path, fn), "wb")
    file.write(cont)
    file.close()

def construct_name (id, fn):
    """Constructs a file name out of messages ID and packed file name"""
    id = id.split(".")
    id = id[0]+id[1]
    return id+"."+fn

def disqo (s):
    """Removes double or single quotations."""
    s = s.strip()
    if s.startswith("'") and s.endswith("'"): return s[1:-1]
    if s.startswith('"') and s.endswith('"'): return s[1:-1]
    return s

def disgra (s):
    """Removes < and > from HTML-like tag or e-mail address or e-mail ID."""
    s = s.strip()
    if s.startswith("<") and s.endswith(">"): return s[1:-1]
    return s

def pullout (m, key):
    """Extracts content from an e-mail message.
    This works for multipart and nested multipart messages too.
    m   -- email.Message() or mailbox.Message()
    key -- Initial message ID (some string)
    Returns tuple(Text, Html, Files, Parts)
    Text  -- All text from all parts.
    Html  -- All HTMLs from all parts
    Files -- Dictionary mapping extracted file to message ID it belongs to.
    Parts -- Number of parts in original message.
    """
    Html = ""
    Text = ""
    Files = {}
    Parts = 0
    if not m.is_multipart():
        if m.get_filename(): # It's an attachment
            fn = m.get_filename()
            cfn = construct_name(key, fn)
            Files[fn] = (cfn, None)
            if file_exists(cfn): return Text, Html, Files, 1
            save_file(cfn, m.get_payload(decode=True))
            return Text, Html, Files, 1
        # Not an attachment!
        # See where this belongs. Text, Html or some other data:
        cp = m.get_content_type()
        if cp=="text/plain": Text += m.get_payload(decode=True)
        elif cp=="text/html": Html += m.get_payload(decode=True)
        else:
            # Something else!
            # Extract a message ID and a file name if there is one:
            # This is some packed file and name is contained in content-type header
            # instead of content-disposition header explicitly
            cp = m.get("content-type")
            try: id = disgra(m.get("content-id"))
            except: id = None
            # Find file name:
            o = cp.find("name=")
            if o==-1: return Text, Html, Files, 1
            ox = cp.find(";", o)
            if ox==-1: ox = None
            o += 5; fn = cp[o:ox]
            fn = disqo(fn)
            cfn = construct_name(key, fn)
            Files[fn] = (cfn, id)
            if file_exists(cfn): return Text, Html, Files, 1
            save_file(cfn, m.get_payload(decode=True))
        return Text, Html, Files, 1
    # This IS a multipart message.
    # So, we iterate over it and call pullout() recursively for each part.
    y = 0
    while 1:
        # If we cannot get the payload, it means we hit the end:
        try:
            pl = m.get_payload(y)
        except: break
        # pl is a new Message object which goes back to pullout
        t, h, f, p = pullout(pl, key)
        Text += t; Html += h; Files.update(f); Parts += p
        y += 1
    return Text, Html, Files, Parts

def extract (msgfile, key):
    """Extracts all data from e-mail, including From, To, etc., and returns it as a dictionary.
    msgfile -- A file-like readable object
    key     -- Some ID string for that particular Message. Can be a file name or anything.
    Returns dict()
    Keys: from, to, subject, date, text, html, parts[, files]
    Key files will be present only when message contained binary files.
    For more see __doc__ for pullout() and caption() functions.
    """
    m = message_from_file(msgfile)
    From, To, Subject, Date = caption(m)
    Text, Html, Files, Parts = pullout(m, key)
    Text = Text.strip(); Html = Html.strip()
    msg = {"subject": Subject, "from": From, "to": To, "date": Date,
        "text": Text, "html": Html, "parts": Parts}
    if Files: msg["files"] = Files
    return msg

def caption (origin):
    """Extracts: To, From, Subject and Date from email.Message() or mailbox.Message()
    origin -- Message() object
    Returns tuple(From, To, Subject, Date)
    If message doesn't contain one/more of them, the empty strings will be returned.
    """
    Date = ""
    if origin.has_key("date"): Date = origin["date"].strip()
    From = ""
    if origin.has_key("from"): From = origin["from"].strip()
    To = ""
    if origin.has_key("to"): To = origin["to"].strip()
    Subject = ""
    if origin.has_key("subject"): Subject = origin["subject"].strip()
    return From, To, Subject, Date

我用mailbox为我的邮件组编写了这个程序,这就是为什么它如此复杂的原因。 我从未失望过。从来没有垃圾。若消息是多部分的,则输出字典将包含 键入“文件”(子目录),其中包含提取的其他非文本或html文件的所有文件名。 这是一种提取附件和其他二进制数据的方法。 您可以在pullout()中更改它,或者只更改file_exists()和save_file()的行为

construct_name()用消息id和多部分消息构造文件名 文件名,如果有

在pullout()中,文本和Html变量是字符串。对于在线邮件组,可以立即将任何文本或HTML打包到非附件的多部分中

如果您需要更复杂的内容,请将文本和Html更改为列表并附加到列表中,然后根据需要添加它们。 没什么问题

这里可能有一些错误,因为它用于处理邮箱.Message(), 不适用于email.Message()。我在email.Message()上试用过,效果很好

你说,你“希望把它们都列出来”。从哪里来?如果您引用POP3邮箱或某个优秀的开源邮件程序的邮箱,那么您可以使用邮箱模块来完成。 如果您想从其他人那里列出它们,那么您就有问题了。 例如,要从MS Outlook获取邮件,您必须知道如何读取OLE2复合文件。 其他邮件程序很少将它们称为*.eml文件,所以我认为这正是您想要做的。 然后在PyPI上搜索olefile或compoundfiles模块,并在Google上搜索如何从MS Outlook收件箱文件中提取电子邮件。 或者把自己弄得一团糟,然后把它们从那里导出到某个目录。当您将它们作为eml文件时,请应用此代码。

我发现这要简单得多

import email
import os

path = './'
listing = os.listdir(path)

for fle in listing:
    if str.lower(fle[-3:])=="eml":
        msg = email.message_from_file(open(fle))
        attachments=msg.get_payload()
        for attachment in attachments:
            try:
                fnam=attachment.get_filename()
                f=open(fnam, 'wb').write(attachment.get_payload(decode=True,))
                f.close()
            except Exception as detail:
                #print detail
                pass
试试这个:

#!python3
# -*- coding: utf-8 -*-

import email
import os

SOURCE_DIR = 'email'
DEST_DIR = 'temp'

def extractattachements(fle,suffix=None):
    message = email.message_from_file(open(fle))
    filenames = []
    if message.get_content_maintype() == 'multipart':
        for part in message.walk():
            if part.get_content_maintype() == 'multipart': continue
            #if part.get('Content-Disposition') is None: continue
            if part.get('Content-Type').find('application/octet-stream') == -1: continue
            filename = part.get_filename()
            if suffix:
                filename = ''.join( [filename.split('.')[0], '_', suffix, '.', filename.split('.')[1]])
            filename = os.path.join(DEST_DIR, filename)
            fb = open(filename,'wb')
            fb.write(part.get_payload(decode=True))
            fb.close()
            filenames.append(filename)
    return filenames

def main():
    onlyfiles = [f for f in os.listdir(SOURCE_DIR) if os.path.isfile(os.path.join(SOURCE_DIR, f))]
    for file in onlyfiles:
        #print path.join(SOURCE_DIR,file)
        extractattachements(os.path.join(SOURCE_DIR,file))
    return True

if __name__ == "__main__":
    main()

在这里发布这篇文章是为了让任何人都能从电子邮件中提取文本并获得.eml文件的列表——我花了很长时间才在网上找到一个好的答案。注意:这将不会得到电子邮件的附件,只是电子邮件中的文本

import email
from email import policy
from email.parser import BytesParser
import glob
import os

path = '/path/to/data/' # set this to "./" if in current directory

eml_files = glob.glob(path + '*.eml') # get all .eml files in a list
for eml_file in eml_files:
    with open(eml_file, 'rb') as fp:  # select a specific email file from the list
        name = fp.name # Get file name
        msg = BytesParser(policy=policy.default).parse(fp)
    text = msg.get_body(preferencelist=('plain')).get_content()
    fp.close()
 
    text = text.split("\n")
    print (name) # Get name of eml file
    print (text) # Get list of all text in email

这篇文章中的一些代码值得赞扬:

您好,欢迎来到StackOverflow。请花些时间阅读帮助页面,特别是命名和的部分。更重要的是,请阅读。您可能还想了解。@Dalen从什么时候开始,我们根据这些问题与网站上最糟糕的问题的近似程度来判断问题?请详细说明此答案,不要简单地链接到网站外。你的链接可能会断开,这将使这个答案无效。@2位炼金术士,因为从来没有。“我只是有点同情,所以我投了反对票。”两位炼金术士,由于链接是打开的,我希望它不会断开。我匆忙地提取了那段代码。我会把它复制到这里,并很快对它进行适当的评论,但是,让b.enoit.be先试一下D@Two-炼金术士炼金术士,我详细阐述了!你现在高兴吗?@Dalen,当调用来自文件()的消息时。我将.eml文件的路径作为agrument传递。当然更简单,但更有限。我解释了为什么我的头发这么翘。人们会使用一种他们认为更适合自己需要的方法。你从我这里得到+1来提供另一个解决方案。我得到了FileNotFoundError,但我不知道为什么,因为我从列表中获得了文件名。我喜欢你的答案,但请注意附件可以是任何类型,例如(图像/jpeg、音频/mpeg、应用程序/msword…),而不仅仅是应用程序/octet流。更好地指示附件是实际附件还是电子邮件本身的一部分的指标是内容处置标题,您检查了该标题,然后将其注释掉。为什么呢?如果有确切的目的,请写一篇评论,说明原因。
import email
from email import policy
from email.parser import BytesParser
import glob
import os

path = '/path/to/data/' # set this to "./" if in current directory

eml_files = glob.glob(path + '*.eml') # get all .eml files in a list
for eml_file in eml_files:
    with open(eml_file, 'rb') as fp:  # select a specific email file from the list
        name = fp.name # Get file name
        msg = BytesParser(policy=policy.default).parse(fp)
    text = msg.get_body(preferencelist=('plain')).get_content()
    fp.close()
 
    text = text.split("\n")
    print (name) # Get name of eml file
    print (text) # Get list of all text in email