Python 仅从文本文件中获取正文、电子邮件_Python_Email_Summarization_Document Body

Python 仅从文本文件中获取正文、电子邮件

python email

Python 仅从文本文件中获取正文、电子邮件,python,email,summarization,document-body,Python,Email,Summarization,Document Body,我想从这个文本文档中删除所有的from、to、cc、subject-sent标记，只保留邮件的正文，这样我就可以用它来总结文档的内容。在python中实现这一点的最佳方法是什么。我认为最好先进行提取，然后对这种情况进行预处理。也在这里附加代码。因此，如果有人能建议如何做到这一点，将非常有帮助。文件的有效负载和ismultipart部分没有正确完成，这是我的疑问所在，因此我对该部分进行了评论，并在那里需要帮助附加代码和下面的.txt文件以供参考 import os, sys, csv impor

我想从这个文本文档中删除所有的from、to、cc、subject-sent标记，只保留邮件的正文，这样我就可以用它来总结文档的内容。在python中实现这一点的最佳方法是什么。我认为最好先进行提取，然后对这种情况进行预处理。也在这里附加代码。因此，如果有人能建议如何做到这一点，将非常有帮助。文件的有效负载和ismultipart部分没有正确完成，这是我的疑问所在，因此我对该部分进行了评论，并在那里需要帮助

附加代码和下面的.txt文件以供参考

import os, sys, csv
import glob
import re
import email
#from tika import parser
import warnings
warnings.filterwarnings(action='ignore', category=UserWarning, module='gensim')
from gensim.summarization import summarize, keywords

# Set path to directory where files are
dirs = 'C:\\Users\\Lenovo\\.spyder-py3\\Testing\\'
#os.chdir(dirs)
for filename in glob.glob(os.path.join(dirs, '*.txt')):
    try:
        for files in filename:
            file = open(filename, 'r', encoding ='utf-8')
            filecontents = file.read()
            filecontents = re.sub(r'\s+', ' ', filecontents)
            print(filecontents)
            filecontents = filecontents.strip('\n')
            b = email.message_from_string(filecontents)# NEED
            if b.is_multipart():#HELP
                for payload in b.get_payload():#HERE
                    # if payload.is_multipart(): ...#SO
                    print (payload.get_payload())#COMMENTED
            else:#
                print (b.get_payload())#
            summary = summarize(filecontents, ratio =0.10)
            print(summary)
            kw = keywords(filecontents, words=15)
            print(kw)
            break
            #writer.writerow([file, summary, kw])
    except Exception as e:
        pass

文本文件

 Stephanie /ANN

From: Mr.A,  <.Mr.A@abc.com>
Sent: Wednesday, July 25, 2018 2:27 PM
To: , Tim /ANN; Abd, May /ANN
Cc: Mr.A, ; Theoder Jerry,
Subject: [EXTERNAL] RE:  Holdings: XXXX SPA – mfno.1322

Dear Dr. Tim A. , 

The option-2 is fine. By the way, we had received in the past Letter of Authorization for many companies other 
than Spa and I guess Xxxx does not do bANNiness with them either. If yes, then need to submit withdrawal 
of Letter of Authorization for those companies and send a Letter of Authorization for spa. stating for any 
applications submitted. We will send an administrative filing issue letter for both the holder and the agent.  

Thank you! 

Regards, 
 Mr.A 
PRODUCT Master File 
CDER 

Currently, there is no requirement to submit or resubmit NAs in any electronic format.  However, starting May 5, 2018, 
new NAs, as well as any submissions to the existing NAs mANNt be submitted electronically in legal (electronic Common 
Technical Document) format specified by GROUP A in the legal guidance. NA submissions that are not submitted in legal 
format after this date may be subject to rejection. For more information please check the NA website 
www.GROUP A.gov/abc/bca 

This communication is an informal communication consistent with which represents my best judgment 
at this time, but does not constitute an advisory opinion, does not necessarily represent the formal position of the 
GROUP A, and does not bind or otherwise obligate or commit the agency to the views expressed. This communication, 
including any attachments, is intended only for the person or entity to which it is addressed and may contain 
confidential material. Any review, retransmission, distribution or other ANNe of this information by persons or entities 
other than the intended recipient is prohibited. If you received this in error, please destroy any copies, contact the 
sender and delete the material from any computer. Thank you. 

From: Tim.@xxxx.com [mailto:Tim.@xxxx.com]  
Sent: Wednesday, July 25, 2018 2:10 PM 
To: Mr.A,  <.Mr.A@abc.com> 
Cc: May.Abd@xxxx.com 
Subject: RE: Holdings: XXXX SPA ‐ dm 013383 

Dear , 

XXXX

2

Thanks for your phone call to clarify your needs and to understand the situation. I have confirmed that Xxxx only does 
direct bANNiness for test  S intermediate with b. and not with the other companies (e, 
x, etc.) that are secondary companies. Based on our discANNsion, I believe that we do not need to 
provide QAs for these secondary companies or mention them in our NA file as they would be covered under a 
separate QA  S.p.A. to them. If this is correct, then I believe you mentioned that we have two options as 
described below: 

Option 1: We can issue a separate QA for each . NA to be specific on which NA is being cross‐referenced 
to our NA 13383. 

Option 2: We can do a single QA for  and mention that they can cross‐reference any of their NAs. This 
would allow them to cross‐reference any of their 

If I have misunderstood or am incorrect in my response and we need to discANNs further, please let me know. 

If not, when you issue your request, can you please send to me and May Abd by email? 

Kind regards. 

Tim 

Tim A. , BsC 
Director, YY SERVICES) 
Xxxx ANN 
Phone/FAX: 2312333 
Cell: 23312123131 
Email: tim.@xxxx.com 

From: , Tim /ANN  
Sent: Monday, July 23, 2018 7:05 AM 
To: 'Mr.A, ' 
Cc: Abd, May /ANN 
Subject: RE: [EXTERNAL] Holder: XXXX SPA - NA 013383 

Dear , 

May is now on vacation and I am covering for her during her absence. Is there a good time to call you today or later this 
week? Please let me know and we can schedule or please call my cell phone 21313131231 at your convenience. 

Kind regards. 

Tim 

Tim A. , MSC 
Director, PQR 
Xxxx 
Phone/FAX: 2312313313 
Cell: 3142342424 
Email: tim.@xxxx.com 

XXXX

3

‐‐‐‐‐‐‐‐‐‐ Forwarded message ‐‐‐‐‐‐‐‐‐‐ 
From: "Mr.A, " <.Mr.A@abc.com> 
Date: Jul 20, 2018 9:01 AM 
Subject: [EXTERNAL] Holder: XXXX SPA ‐ NA 013383 
To: "TRETE/ANN" <May.Abd@xxxx.com> 
Cc: "mno.com> 

Dear May Abd, 

. I need to talk to you on this.  

Thank you! 

Regards, 
 Mr.A 
PRODUCT Master File 
CDER 

Currently, there is no requirement to submit or resubmit NAs in any electronic format.   
format after this date may be subject to rejection. For more information please check the NA website 
www.GROUP A./cder/NA   

This communication is an informal communication  which represents my best judgment 
at this time, but does not constitute an advisory opinion, does not necessarily represent the formal position of the 
GROUP A, and does not bind or otherwise obligate or commit the agency to the views expressed. This communication, 
including any attachments, is intended only for the person or entity to which it is addressed and may contain 
confidential material. Any review, retransmission, distribution or other ANNe of this information by persons or entities 
other than the intended recipient is prohibited. If you received this in error, please destroy any copies, contact the 
sender and delete the material from any computer. Thank you. 

XXXX

斯蒂芬妮/安发件人：A先生，发送日期：2018年7月25日星期三下午2:27 致：，蒂姆/安；阿布德，五月/日抄送：A先生；西奥德·杰里，主题：[外部]回复：控股：XXXX SPA–编号1322 亲爱的Tim A.博士：，第二种选择很好。顺便说一下，我们在过去收到了许多其他公司的授权书我猜Xxxx也不会禁止他们。如果是，则需要提交取款单为这些公司提供授权书，并为spa发送授权书。说明任何提交的申请。我们将向持有人和代理人发送行政备案问题函。非常感谢。当做 A先生产品主文件 CDER 目前，不要求以任何电子格式提交或重新提交NAs。但是，从2018年5月5日开始，新NAs以及向现有NAs提交的任何文件不得以电子方式提交（通用电子版）技术文件）A组在法律指南中规定的格式。未以法律形式提交的NA提交此日期后的格式可能会被拒绝。有关更多信息，请访问NA网站 www.groupa.gov/abc/bca 这种交流是一种非正式的交流，代表了我的最佳判断此时，但不构成咨询意见，不一定代表委员会的正式立场 A组，且不约束或以其他方式义务或承诺代理机构遵守所表达的观点。这个通讯,，包括任何附件，仅适用于收件人或实体，可能包含机密材料。个人或实体对本信息的任何审查、重新传输、分发或其他方式禁止指定收件人以外的其他人。如果您收到此错误，请销毁任何副本，联系发送并从任何计算机上删除该材料。非常感谢。发件人：Tim.@xxxx.com[邮件收件人：Tim.@xxxx.com] 发送日期：2018年7月25日星期三下午2:10 致：A先生，抄送：五月。Abd@xxxx.com 主题：回复：控股：XXXX SPA-dm 013383 亲爱的， XXXX 2. 感谢您打电话澄清您的需求并了解情况。我已经确认Xxxx只会含b的试验S中间体的直接禁止性。而不是与其他公司（e， x、等）是二级公司。根据我们的讨论，我相信我们不需要这样做为这些二级公司提供QA，或在我们的NA文件中提及它们，因为它们将在单独向他们提供QA S.p.A。如果这是正确的，那么我相信你提到我们有两个选择如下所述：选项1：我们可以为每个问题单独发布QA。NA是交叉引用NA的具体内容致我们的NA13383。选项2：我们可以为他们做一次单独的质量保证，并提到他们可以交叉引用他们的任何NAs。这允许他们交叉引用他们的如果我的回答有误解或不正确，我们需要进一步讨论，请让我知道。如果没有，当您发出请求时，请发送给我，并通过电子邮件发送给我？亲切的问候。提姆蒂姆A.，理学学士 YY服务总监 Xxxx安电话/传真：2312333 电话：23312123131 电子邮件：tim.@xxxx.com 发件人：，蒂姆/安发送日期：2018年7月23日星期一上午7:05 致：“A先生” 抄送：Abd，May/ANN 主题：RE:[外部]持有人：XXXX SPA-NA 013383 亲爱的，梅现在休假，她不在时我替她代班。今天或今天晚些时候给你打电话好吗星期？请让我知道，我们可以安排，或者在您方便的时候拨打我的手机2131131231。亲切的问候。提姆蒂姆A.，理学硕士 PQR总监 Xxxx 电话/传真：2313313 电话：3142342424 电子邮件：tim.@xxxx.com XXXX 3. -转发消息-转发消息发件人：“A先生，” 日期：2018年7月20日上午9:01 主题：[外部]持有人：XXXX SPA-NA 013383 致：“特雷特/安” 抄送：“mno.com> 亲爱的May Abd， .我需要和你谈谈这件事。非常感谢。当做 A先生产品主文件 CDER 目前，不要求以任何电子格式提交或重新提交NAs。此日期后的格式可能会被拒绝。有关更多信息，请查看NA网站 www.groupa./cder/NA 这种交流是一种非正式的交流，代表了我的最佳判断此时，但不构成咨询意见，不一定代表委员会的正式立场 A组，且不约束或以其他方式使代理机构有义务或承诺接受所表达的观点。本通信，包括任何附件，仅适用于收件人或实体，可能包含机密材料。个人或实体对本信息的任何审查、重新传输、分发或其他方式禁止非预期收件人。如果您错误地收到此邮件，请销毁任何副本，请联系发送和删除任何计算机上的材料。谢谢。 XXXX

如果您只想从电子邮件中删除From、Sent、to、Cc、Subject和Forwarded标记，可以使用regex

import re

with open('email_input.txt', 'r') as input:
   lines = input.readlines()
   no_new_lines = [i.strip() for i in lines]
   for line in no_new_lines:
      email_component = re.compile(r'((From:|Sent:|To:|Cc:|Subject:|Forwarded message).*)', re.IGNORECASE)
      remove_component = re.findall(email_component, line)
      if remove_component:
         print(line)

         # output
         ‐‐‐‐‐‐‐‐‐‐ Forwarded message ‐‐‐‐‐‐‐‐‐‐
         From: Mr.A,  <.Mr.A@abc.com>
         Sent: Wednesday, July 25, 2018 2:27 PM
         To: , Tim /ANN; Abd, May /ANN
         Cc: Mr.A, ; Theoder Jerry,
         Subject: [EXTERNAL] RE:  Holdings: XXXX SPA – mfno.1322

更新答案一

下面的更新答案会清理您的一些电子邮件输入，但需要进行更多清理

import re

with open('email_input.txt', 'r') as input:
   lines = input.readlines()

   # Remove some of the extra lines
   no_new_lines = [i.strip() for i in lines]

   # regex to catch header lines
   email_component = re.compile(r'((From:|Sent:|To:|Cc:|Subject:|Date:|Forwarded message).*)', re.IGNORECASE)
   remove_headers = [x for x in no_new_lines if not email_component.findall(x)]

   # regex to catch greeting lines
   greeting_component = re.compile(r'(Dear.*)', re.IGNORECASE)
   remove_greeting = [x for x in remove_headers if not greeting_component.findall(x)]

   # regex to catch lines with contact details
   contact_component = re.compile(r'(Phone.*:)|(Cell:.*)|(Email:.*)', re.IGNORECASE)
   remove_contacts = [x for x in remove_greeting if not contact_component.findall(x)]

    # regex to catch lines with salutation
    email_salutation_component = re.compile(r'Best,(.*?)|Best regards,(.*?)|Best wishes,(.*?)|Fond regards,(.*?)|'
                                        r'Kind regards(.*?)|Regards,(.*?)|Sincerely,(.*?)|Sincerely yours,(.*?)|'
                                        r'Thank you,(.*?)|With appreciation,(.*?)|Yours sincerely,(.*?)', re.IGNORECASE)

    remove_salutations = [x for x in remove_contacts if not email_salutation_component.findall(x)]

    # do something else

import re with open('email_input.txt', 'r') as input: lines = input.readlines() # Remove some of the extra lines no_new_lines = [i.strip() for i in lines] # regex to catch header lines email_component = re.compile(r'((From:|Sent:|To:|Cc:|Subject:|Date:|Forwarded message).*)', re.IGNORECASE) remove_headers = [x for x in no_new_lines if not email_component.findall(x)] # regex to catch greeting lines greeting_component = re.compile(r'(Dear.*)', re.IGNORECASE) remove_greeting = [x for x in remove_headers if not greeting_component.findall(x)] # regex to catch lines with contact details contact_component = re.compile(r'(Phone.*:)|(Cell:.*)|(Email:.*)', re.IGNORECASE) remove_contacts = [x for x in remove_greeting if not contact_component.findall(x)] # regex to catch lines with salutation email_salutation_component = re.compile(r'Best,(.*?)|Best regards,(.*?)|Best wishes,(.*?)|Fond regards,(.*?)|' r'Kind regards(.*?)|Regards,(.*?)|Sincerely,(.*?)|Sincerely yours,(.*?)|' r'Thank you,(.*?)|With appreciation,(.*?)|Yours sincerely,(.*?)', re.IGNORECASE) remove_salutations = [x for x in remove_contacts if not email_salutation_component.findall(x)] # do something else

import email
from gensim.summarization import summarize, keywords

with open('email_input.txt', 'r') as input:
  email_body = ''
  raw_message = input.read()

  # Return a message object structure from a string
  msg = email.message_from_string(raw_message)

  # iterate over all the parts and subparts of a message object tree
  for part in msg.walk():

    # Return the message’s content type.
    if part.get_content_type() == 'text/plain':
        email_body = part.get_payload()

  summary = summarize(email_body, ratio=0.10)
  print(summary)

  kw = keywords(email_body, words=15)
  print(kw)

with open('email_input.txt') as infile:
  # Boolean state variable to keep track of whether we want to be printing lines or not
  lines_to_keep = False
  for line in infile:

    # Look for lines that start with a greeting
    if line.startswith("Dear"):
      # set lines_to_keep true and start capturing lines
      lines_to_keep = True

    # Look for lines that start with a salutation
    elif line.startswith("Regards") or line.startswith("Kind regards"):
        # set lines_to_keep false and stop capturing lines
        lines_to_keep = False


    if lines_to_keep:
        greeting_component = re.compile(r'(Dear.*)', re.IGNORECASE)
        remove_greeting = re.match(greeting_component, line)
        if not remove_greeting:
            print (line.rstrip('\n'))
            # output 
            The option-2 is fine. By the way, we had received in the past Letter of Authorization for many companies other than Spa and I guess Xxxx does not do bANNiness with them either. If yes, then need to submit withdrawal of Letter of Authorization for those companies and send a Letter of Authorization for spa. stating for any applications submitted. We will send an administrative filing issue letter for both the holder and the agent.  

           more here....

from email import message_from_binary_file

for filename in glob.glob(os.path.join(dirs, '*.txt')):
    # Not useful; we already have a filename
    #for files in filename:
    # Open in binary mode, don't try to guess encoding
    # Use a context manager so we don't leave the file open
    with open(filename, 'rb') as file:
        # Just let the email library take it from here
        #filecontents = file.read()
        #filecontents = re.sub(r'\s+', ' ', filecontents)
        #print(filecontents)
        #filecontents = filecontents.strip('\n')
        b = email.message_from_binary_file(file)
    if b.is_multipart():
        # There are a number of things you could do to pick out
        # one or more payloads for analysis, but let's just take
        # the first text/plain part and call it "main_part"
        for part in b.walk()
            if part.get_content_type() == 'text/plain':
                main_part = part.get_payload()
                break
    else:
        main_part = b.get_payload()
    summary = summarize(main_part, ratio =0.10)
    print(summary)
    kw = keywords(main_part, words=15)
    print(kw)

from email.policy import default as default_email_policy
...
    b = email.message_from_binary_file(file, policy=default_email_policy)
    main_part = b.get_body(['related', 'plain', 'html'])