使用Python解析包含子部分的多部分电子邮件_Python

使用Python解析包含子部分的多部分电子邮件

python

使用Python解析包含子部分的多部分电子邮件,python,Python,我正在使用此函数解析电子邮件。我能够解析“简单”的多部分电子邮件，但当电子邮件定义多个边界（子部分）时，它会产生一个错误（UnboundLocalError：分配前引用的局部变量“html”）。我希望脚本将文本和html部分分开，只返回html部分（除非没有html部分，否则返回文本）就像评论中说的，你总是检查html，但只在一个特定的情况下声明它。这就是错误告诉您的，您在分配html之前引用它。在python中，如果某个对象没有被分配给任何对象，则检查该对象是否为None是无效的。例如，打开

我正在使用此函数解析电子邮件。我能够解析“简单”的多部分电子邮件，但当电子邮件定义多个边界（子部分）时，它会产生一个错误（UnboundLocalError：分配前引用的局部变量“html”）。我希望脚本将文本和html部分分开，只返回html部分（除非没有html部分，否则返回文本）

就像评论中说的，你总是检查html，但只在一个特定的情况下声明它。这就是错误告诉您的，您在分配html之前引用它。在python中，如果某个对象没有被分配给任何对象，则检查该对象是否为None是无效的。例如，打开python交互式提示符：

>>> if y is None:
...   print 'none'
... 
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'y' is not defined

这就解释了一点：

以下是与OlliM有用建议相同的代码。如果没有此更改，您将无法正确解析电子邮件中的“多部分/可选”容器

import chardet

def get_text(msg):
    """ Parses email message text, given message object
    This doesn't support infinite recursive parts, but mail is usually not so naughty.
    """
    text = ""
    if msg.is_multipart():
        html = None
        for part in msg.get_payload():
            if part.get_content_charset() is None:
                charset = chardet.detect(str(part))['encoding']
            else:
                charset = part.get_content_charset()
            if part.get_content_type() == 'text/plain':
                text = unicode(part.get_payload(decode=True),str(charset),"ignore").encode('utf8','replace')
            if part.get_content_type() == 'text/html':
                html = unicode(part.get_payload(decode=True),str(charset),"ignore").encode('utf8','replace')
            if part.get_content_type() == 'multipart/alternative':
                for subpart in part.get_payload():
                    if subpart.get_content_charset() is None:
                        charset = chardet.detect(str(subpart))['encoding']
                    else:
                        charset = subpart.get_content_charset()
                    if subpart.get_content_type() == 'text/plain':
                        text = unicode(subpart.get_payload(decode=True),str(charset),"ignore").encode('utf8','replace')
                    if subpart.get_content_type() == 'text/html':
                        html = unicode(subpart.get_payload(decode=True),str(charset),"ignore").encode('utf8','replace')

        if html is None:
            return text.strip()
        else:
            return html.strip()
    else:
        text = unicode(msg.get_payload(decode=True),msg.get_content_charset(),'ignore').encode('utf8','replace')
        return text.strip()

编写不重复任何代码的更优雅的结构留给读者作为练习

另外，请检查此项。

我的代码中需要进行以下更改：unicode更改为str，str（部分）更改为字节（部分），在charset=chardet.detect（str（部分））['encoding']。我将bs4应用于html。代码对我的项目很有用。谢谢。

您似乎遗漏了一些有用的信息：什么是错误消息，什么是邮件处理库。另外，您总是检查名为html的变量，但仅当存在文本/html部分时才声明它。邮件处理库就是“电子邮件”。html变量仅在此函数中使用。知道为什么会发生错误吗？这段代码很有用。我所做的唯一更改是支持“multipart/alternative”类型的部分——您必须对该部分的子部分执行相同的循环。

def get_text(msg):
text = ""
if msg.is_multipart():
    html = None
    for part in msg.get_payload():
        if part.get_content_charset() is None:
            charset = chardet.detect(str(part))['encoding']
        else:
            charset = part.get_content_charset()
        if part.get_content_type() == 'text/plain':
            text = unicode(part.get_payload(decode=True),str(charset),"ignore").encode('utf8','replace')
        if part.get_content_type() == 'text/html':
            html = unicode(part.get_payload(decode=True),str(charset),"ignore").encode('utf8','replace')
    if html is None:
        return text.strip()
    else:
        return html.strip()
else:
    text = unicode(msg.get_payload(decode=True),msg.get_content_charset(),'ignore').encode('utf8','replace')
    return text.strip()

import chardet

def get_text(msg):
    """ Parses email message text, given message object
    This doesn't support infinite recursive parts, but mail is usually not so naughty.
    """
    text = ""
    if msg.is_multipart():
        html = None
        for part in msg.get_payload():
            if part.get_content_charset() is None:
                charset = chardet.detect(str(part))['encoding']
            else:
                charset = part.get_content_charset()
            if part.get_content_type() == 'text/plain':
                text = unicode(part.get_payload(decode=True),str(charset),"ignore").encode('utf8','replace')
            if part.get_content_type() == 'text/html':
                html = unicode(part.get_payload(decode=True),str(charset),"ignore").encode('utf8','replace')
            if part.get_content_type() == 'multipart/alternative':
                for subpart in part.get_payload():
                    if subpart.get_content_charset() is None:
                        charset = chardet.detect(str(subpart))['encoding']
                    else:
                        charset = subpart.get_content_charset()
                    if subpart.get_content_type() == 'text/plain':
                        text = unicode(subpart.get_payload(decode=True),str(charset),"ignore").encode('utf8','replace')
                    if subpart.get_content_type() == 'text/html':
                        html = unicode(subpart.get_payload(decode=True),str(charset),"ignore").encode('utf8','replace')

        if html is None:
            return text.strip()
        else:
            return html.strip()
    else:
        text = unicode(msg.get_payload(decode=True),msg.get_content_charset(),'ignore').encode('utf8','replace')
        return text.strip()