在php中从邮件中提取所有电子邮件标题,包括正文部分

在php中从邮件中提取所有电子邮件标题,包括正文部分,php,regex,string,email,extraction,Php,Regex,String,Email,Extraction,我想在php中使用正则表达式从下面的链式邮件中提取正文部分。 链邮件以txt格式保存。提取时,如果正文标记中存在html标记,则应保持不变 $content = <<<HEREDOC From: Matrimony <matrimony@mangalsutrabandhan.in> Sent: Fri, 12 Aug 2011 16:17:40 To: "matrimony@mangalsutrabandhan.com" <matr

我想在php中使用正则表达式从下面的链式邮件中提取正文部分。 链邮件以txt格式保存。提取时,如果正文标记中存在html标记,则应保持不变

 $content = <<<HEREDOC

    From: Matrimony <matrimony@mangalsutrabandhan.in>
    Sent: Fri, 12 Aug 2011 16:17:40
    To: "matrimony@mangalsutrabandhan.com" <matrimony@mangalsutrabandhan.in>
    Subject: Re: bride search


    From: brides <sales@mangalsutrabandhan.com>
    Sent: Fri, 12 Aug 2011 15:49:52
    To: "Matrimony " <matrimony@mangalsutrabandhan.in>
    Cc: "groom" <brides@mangalsutrabandhan.com>
    Subject: Re: bride search
    PFA

    Regds.,
    sales


    From: shaadi <kundaali@mangalsutrabandhan.in>
    Sent: Tue, 22 Feb 2011 16:40:24
    To: <vivaah@mangalsutrabandhan.com>, <bandhan@mangalsutrabandhan.com>
    Cc: "'lagna '" <lagna@mangalsutrabandhan.in>, <movies@mangalsutrabandhan.in>, <manishv@mangalsutrabandhan.com>, "'beta data'" <channel@mangalsutrabandhan.com>, "'test S'" <city@mangalsutrabandhan.com>
    Subject: Re:data transfer would be made live for 145 test

    This is to inform you that we are going to test today.



    Activity Timing: 9:00 PM onwards



    Thanks and Regards,

    free matrimony

    shaadi Operations


     P  Please do not print this e-mail unless it is absolutely necessary

    From: shaadi [nikaah:kundaali@mangalsutrabandhan.in]
    Sent: 21 February 2011 23:09
    To: vivaah@mangalsutrabandhan.com; bandhan@mangalsutrabandhan.com
    Cc: 'lagna '; movies@mangalsutrabandhan.in; manishv@mangalsutrabandhan.com; 
    Subject: data transfer would be made live for 145 test



    Hi,

    gtsdhsdbh
    anbdsmbsa
    sda the data test .

    Would request you to send in your feedback.



    Thanks and Regards,



    beta data

    assa xyz


     P  Please do not print this e-mail unless it is absolutely necessary



    HEREDOC;
这个正则表达式是我过去在o/p上面得到的

preg_match_all('/(?<=Subject: )(.*?[\n][\s]*?)(?=From:)/is',$content,$rest);
但它没有给出最后一个,因为它没有“from”来获取中间数据。 希望它是清楚的。 请让我知道,如果有任何其他方法,也为这个

preg_match_all('/(?m:^From:\x20(?<From>[^\n]*)\n^Sent:\x20(?<Sent>[^\n]*)\n^To:\x20(?<To>[^\n]*)\n(?:^Cc:\x20(?<Cc>[^\n]*)\n)?^Subject:\x20(?<Subject>[^\n]*)\n)(?<Body>.*?(?=(?:\nFrom:)|$))/s',$content,$matches);
echo "<pre>".print_r($matches,true);

它提供了几乎正确的o/p。我应该在

上提供文本文件吗?你需要更智能的解析来理解这一点-无论产生什么,该文件都会改变电子邮件的结构:

邮件标题和正文之间至少应有一行空白

然后你会遇到这样的问题,你不能依赖于报头中的时间戳而不知道时区、不完整的报头和


因此,即使您构建了一个启发式方法来解析它,它也无法处理太多的场景。

我不知道正则表达式是否是最佳选择。您最好基于to/from/subject数据的集群来拆分文档。在此基础上,任何介于两者之间的内容都应被视为内容。您是否会编辑您的问题以澄清所需的输出?
preg_match_all('/(?m:^From:\x20(?<From>[^\n]*)\n^Sent:\x20(?<Sent>[^\n]*)\n^To:\x20(?<To>[^\n]*)\n(?:^Cc:\x20(?<Cc>[^\n]*)\n)?^Subject:\x20(?<Subject>[^\n]*)\n)(?<Body>.*?(?=(?:\nFrom:)|$))/s',$content,$matches);
echo "<pre>".print_r($matches,true);
Subject: Re: bride search
PFA