Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/286.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/wix/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python “如何删除”;转发消息;安然电子邮件正文中的标题和不需要的内容?_Python_Email_Nltk_Python Textprocessing - Fatal编程技术网

Python “如何删除”;转发消息;安然电子邮件正文中的标题和不需要的内容?

Python “如何删除”;转发消息;安然电子邮件正文中的标题和不需要的内容?,python,email,nltk,python-textprocessing,Python,Email,Nltk,Python Textprocessing,我正试图将安然所有电子邮件的正文附加到一个文件中,这样我就可以通过删除停止词并使用NLTK将其拆分成句子来处理这些电子邮件的文本。 我的问题是转发和回复的邮件,我不知道如何清理它们。 这是我目前的代码: import os, email, sys, re,nltk, pprint from email.parser import Parser rootdir = '/Users/art/Desktop/maildir/lay-k/elizabeth' #fun

我正试图将安然所有电子邮件的正文附加到一个文件中,这样我就可以通过删除停止词并使用NLTK将其拆分成句子来处理这些电子邮件的文本。 我的问题是转发和回复的邮件,我不知道如何清理它们。 这是我目前的代码:

    import os, email, sys, re,nltk, pprint 
    from email.parser import Parser

    rootdir = '/Users/art/Desktop/maildir/lay-k/elizabeth'
    #function that appends all the body parts of Emails
    def email_analyse(inputfile,  email_body):
        with open(inputfile, "r") as f:
        data = f.read()

        email = Parser().parsestr(data)

        email_body.append(email.get_payload())
    #end of function
    #defining a list that will contain bodies
    email_body = []
    #call the function email_analyse for every function in directory
    for directory, subdirectory, filenames in  os.walk(rootdir):
        for filename in filenames:
            email_analyse(os.path.join(directory, filename),  email_body )
    #the stage where I clean the emails

    with open("email_body.txt", "w") as f:
        for val in email_body:
            if(val):
                val = val.replace("\n", "")
                val = val.replace("=01", "")
                #for some reason I had many of ==20 and =01 in my text
                val = val.replace("==20", "")
                f.write(val)
                f.write("\n")
这是部分输出:
好吧,有了摄影师和乐队,我想说我们已经超出预算了!这是关于摄影师的信息。我觉得我们可以在排练晚宴上至少协商几个小时的一些主要方案。我不知道这通常要花多少钱,但他不便宜由Elizabeth Lay/HOU/AZURIX于1999年9月13日晚上7:34转发---------------------------acollins@reggienet.com1999年9月13日05:37:37请回复acollins@reggienet.com收件人:伊丽莎白·雷/侯/AZURIX@AZURIXcc:主题:丹尼斯·雷吉婚礼摄影你好,伊丽莎白:祝贺你即将结婚!我是阿什利·柯林斯,雷吉先生的协调员。琳达·凯斯勒将您的电子邮件地址转发给我,以便我可以向您提供雷吉先生婚礼摄影的摄影范围信息。

因此,结果根本不是纯文本。有没有关于如何正确处理的想法?

您可能需要查看正则表达式来解析转发和回复文本,因为整个语料库的格式应该是一致的

要删除转发的文本,可以使用如下正则表达式:

-{4,}(.*)(\d{2}:\d{2}:\d{2})\s*(PM|AM)
它将以XX:XX:XX PM格式匹配四个或更多连字符与时间之间的所有内容。匹配3个破折号可能也可以。我们只是想避免在电子邮件正文中匹配连字符和em破折号。您可以使用此正则表达式并编写自己的正则表达式,以便在以下链接中匹配到和主题标题:

您还可以查看NLTK手册的第3.4节,其中讨论了Python中的正则表达式:


祝你好运!这听起来像是一个有趣的项目。

如果您仍然对这个问题感兴趣,我已经专门为安然数据集创建了一个预处理脚本。您会注意到,新的电子邮件总是以标记“subject:”开头,我实现了一个功能,可以删除此标记左侧的所有文本,并且仅在最后一个“subject:”标记上删除所有转发的邮件。具体代码:

# Cleaning content column
df['content'] = df['content'].str.rsplit('Subject: ').str[-1] 
df['content'] = df['content'].str.rsplit(' --------------------------- ').str[-1] 
整体脚本,如果感兴趣:

# Importing the dataset, and defining columns
import pandas as pd
df = pd.read_csv('enron_05_17_2015_with_labels_v2.csv', usecols=[2,3,4,13], dtype={13:str})

# Building a count of how many people are included in an email
df['Included_In_Email'] = df.To.str.count(',')
df['Included_In_Email'] = df['Included_In_Email'].apply(lambda x: x+1)

# Dropping any NaN's, and emails with >15 recipients
df = df.dropna()
df = df[~(df['Included_In_Email'] >=15)]

# Seperating remaining emails into a line-per-line format
df['To'] = df.To.str.split(',')
df2 = df.set_index(['From', 'Date', 'content', 'Included_In_Email']) 
['To'].apply(pd.Series).stack()
df2 = df2.reset_index()
df2.columns = ['From','To','Date','content', 'Included_In_Email']

# Renaming the new column, dropping unneeded column, and changing indices
del df2['level_4']
df2 = df2.rename(columns = {0: 'To'})
df2 = df2[['Date','From','To','content','Included_In_Email']]
del df

# Cleaning email addresses
df2['From'] = df2['From'].map(lambda x: x.lstrip("frozenset"))
df2['To'] = df2['To'].map(lambda x: x.lstrip("frozenset"))
df2['From'] = df2['From'].str.strip("<\>(/){?}[:]*, ")
df2['To'] = df2['To'].str.strip("<\>(/){?}[:]*, ")
df2['From'] = df2['From'].str.replace("'", "")
df2['To'] = df2['To'].str.replace("'", "")
df2['From'] = df2['From'].str.replace('"', "")
df2['To'] = df2['To'].str.replace('"', "")

# Acccounting for users having different emails
email_dict = pd.read_csv('dict_email.csv')    
df2['From'] = df2.From.replace(email_dict.set_index('Old')['New'])
df2['To'] = df2.To.replace(email_dict.set_index('Old')['New'])
del email_dict

# Removing emails not containing @enron
df2['Enron'] = df2.From.str.count('@enron')
df2['Enron'] = df2['Enron']+df2.To.str.count('@enron')
df2 = df2[df2.Enron != 0]
df2 = df2[df2.Enron != 1]
del df2['Enron']

# Adding job roles which correspond to staff
import csv
with open('dict_role.csv') as f:
   role_dict = dict(filter(None, csv.reader(f)))
df2['Sender_Role'] = df2['From'].map(role_dict)
df2['Receiver_Role'] = df2['To'].map(role_dict)
df2 = df2[['Date','From','To','Sender_Role','Receiver_Role','content','Included_In_Email']]
del role_dict

# Cleaning content column
df2['content'] = df2['content'].str.rsplit('Subject: ').str[-1] 
df2['content'] = df2['content'].str.rsplit(' --------------------------- ').str[-1] 

# Condensing records into one line per email exchange, adding weights
Weighted = df2.groupby(['From', 'To']).count()

# Adding weight column, removing redundant columns, splitting indexed column
Weighted['Weight'] = Weighted['Date']
Weighted = 
Weighted.drop(['Date','Sender_Role','Receiver_Role','content','Included_In_Email'], 1)
Weighted.reset_index(inplace=True)

# Re-adding job-roles to staff
with open('dict_role.csv') as f:
   role_dict = dict(filter(None, csv.reader(f)))
Weighted['Sender_Role'] = Weighted['From'].map(role_dict)
del role_dict

# Dropping exchanges with a weight of <= x, or no identifiable role
Weighted2 = Weighted[~(Weighted['Weight'] <=3)]
Weighted2 = Weighted.dropna()
#导入数据集并定义列
作为pd进口熊猫
df=pd.read_csv('enron_05_17_2015_与_标签_v2.csv',usecols=[2,3,4,13],dtype={13:str})
#建立电子邮件中包含多少人的计数
df['Included_In_Email']=df.To.str.count(','))
df['Included_In_Email']=df['Included_In_Email']。应用(lambda x:x+1)
#删除任何NaN和超过15个收件人的电子邮件
df=df.dropna()
df=df[~(df['包含在电子邮件中]>=15)]
#将剩余的电子邮件拆分为每行一行的格式
df['To']=df.To.str.split(','))
df2=df.set_索引(['From','Date','content','Included_In_Email']))
['To'].apply(pd.Series.stack())
df2=df2.reset_index()
df2.columns=[“发件人”、“收件人”、“日期”、“内容”、“包含在电子邮件中”]
#重命名新列、删除不需要的列和更改索引
del df2[“4级”]
df2=df2.rename(列={0:'To'})
df2=df2[[“日期”,“从”,“到”,“内容”,“包含在电子邮件中]]
德尔夫
#清理电子邮件地址
df2['From']=df2['From'].map(lambda x:x.lstrip(“冻结集”))
df2['To']=df2['To'].map(lambda x:x.lstrip(“冻结集”))
df2['From']=df2['From'].str.strip((/){?}[:]*,”)
df2['To']=df2['To'].str.strip((/){?}[:]*,”)
df2['From']=df2['From'].str.replace(“,”)
df2['To']=df2['To'].str.replace(“'”,“”)
df2['From']=df2['From'].str.replace(“,”)
df2['To']=df2['To'].str.replace('“,”)
#拥有不同电子邮件的用户的帐户计数
email\u dict=pd.read\u csv('dict\u email.csv'))
df2['From']=df2.From.replace(电子邮件地址设置索引('Old')['New']))
df2['To']=df2.To.replace(电子邮件地址设置索引('Old')['New']))
电子邮件地址
#删除不包含@enron的电子邮件
df2['Enron']=df2.From.str.count('@Enron')
df2['Enron']=df2['Enron']+df2.To.str.count('@Enron')
df2=df2[df2.Enron!=0]
df2=df2[df2.Enron!=1]
del df2[“安然”]
#添加与员工对应的职务角色
导入csv
以open('dict_role.csv')作为f:
role_dict=dict(过滤器(无,csv.reader(f)))
df2['Sender_Role']=df2['From'].map(Role_dict)
df2['Receiver_Role']=df2['To'].map(Role_dict)
df2=df2[[“日期”、“发件人”、“收件人”、“发件人角色”、“收件人角色”、“内容”、“包含在电子邮件中]]
德尔罗·迪克特
#清洗内容栏
df2['content']=df2['content'].str.rsplit('Subject:').str[-1]
df2['content']=df2['content'].str.rsplit('------------------------------').str[-1]
#将每个电子邮件交换的记录压缩为一行,增加权重
加权=df2.groupby(['From','To']).count()
#添加权重列、删除冗余列、拆分索引列
加权['Weight']=加权['Date']
加权=
加权.drop(['Date'、'Sender\u Role'、'Receiver\u Role'、'content'、'Included\u In\u Email'],1)
加权重置指数(就地=真)
#向员工重新添加工作角色
以open('dict_role.csv')作为f:
role_dict=dict(过滤器(无,csv.reader(f)))
加权['Sender\u Role']=加权['From'].映射(Role\u dict)
德尔罗·迪克特
#用一个重量级的