Php HTML转换为纯文本（用于电子邮件）_Php_Html_Text

Php HTML转换为纯文本（用于电子邮件）

php html text

Php HTML转换为纯文本（用于电子邮件）,php,html,text,Php,Html,Text,你知道什么好的HTML到纯文本转换类吗用PHP写的我需要它转换HTML邮件正文到纯文本邮件正文我写了一个简单的函数，但我需要更多的功能，比如转换表格，在末尾添加链接，转换嵌套列表 -- 问候 takeshin我建议使用HTML标记转换器这里的一个特定邮件发送实现只是与HTML一起生成，并将其输出用于文本版本。它不是完美的，但很有效。您也可以使用或。我知道问题是关于PHP的，但我使用lynx的思想制作了这个Perl子例程来将HTML转换为文本： use File::Temp;

你知道什么好的HTML到纯文本转换类吗用PHP写的

我需要它转换HTML邮件正文到纯文本邮件正文

我写了一个简单的函数，但我需要更多的功能，比如转换表格，在末尾添加链接，转换嵌套列表

-- 问候

takeshin

我建议使用HTML标记转换器

这里的一个特定邮件发送实现只是与HTML一起生成，并将其输出用于文本版本。它不是完美的，但很有效。您也可以使用或。

我知道问题是关于PHP的，但我使用lynx的思想制作了这个Perl子例程来将HTML转换为文本：

use File::Temp;

sub html2Txt {
    my $html = shift;
    my $htmlF = File::Temp->new(SUFFIX => '.html');
    print $htmlF $html;
    close $htmlF;
    return scalar `/usr/bin/lynx -dump $htmlF 2> /dev/null`;
}

print html2Txt '<b>Hi there</b> Testing';

使用File:：Temp；
子HTML2Text{
我的$html=shift；
my$htmlF=File:：Temp->new（后缀=>'.html'）；
打印$htmlF$html；
关闭$htmlF；
返回标量`/usr/bin/lynx-dump$htmlF 2>/dev/null`；
}
打印HTML2Text“Hi-there Testing”；

打印：

Hi-there-Testing

您可以使用带有-stdin和-dump选项的lynx来实现：

<?php
$descriptorspec = array(
   0 => array("pipe", "r"),  // stdin is a pipe that the child will read from
   1 => array("pipe", "w"),  // stdout is a pipe that the child will write to
   2 => array("file", "/tmp/htmp2txt.log", "a") // stderr is a file to write to
);

$process = proc_open('lynx -stdin -dump 2>&1', $descriptorspec, $pipes, '/tmp', NULL);

if (is_resource($process)) {
    // $pipes now looks like this:
    // 0 => writeable handle connected to child stdin
    // 1 => readable handle connected to child stdout
    // Any error output will be appended to htmp2txt.log

    $stdin = $pipes[0];
    fwrite($stdin,  <<<'EOT'
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
 <title>TEST</title>
</head>
<body>
<h1><span>Lorem Ipsum</span></h1>

<h4>"Neque porro quisquam est qui dolorem ipsum quia dolor sit amet, consectetur, adipisci velit..."</h4>
<h5>"There is no one who loves pain itself, who seeks after it and wants to have it, simply because it is pain..."</h5>
<p>
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Pellentesque et sapien ut erat porttitor suscipit id nec dui. Nam rhoncus mauris ac dui tristique bibendum. Aliquam molestie placerat gravida. Duis vitae tortor gravida libero semper cursus eu ut tortor. Nunc id orci orci. Suspendisse potenti. Phasellus vehicula leo sed erat rutrum sed blandit purus convallis.
</p>
<p>
Aliquam feugiat, neque a tempus rhoncus, neque dolor vulputate eros, non pellentesque elit lacus ut nunc. Pellentesque vel purus libero, ultrices condimentum lorem. Nam dictum faucibus mollis. Praesent adipiscing nunc sed dui ultricies molestie. Quisque facilisis purus quis felis molestie ut accumsan felis ultricies. Curabitur euismod est id est pretium accumsan. Praesent a mi in dolor feugiat vehicula quis at elit. Mauris lacus mauris, laoreet non molestie nec, adipiscing a nulla. Nullam rutrum, libero id pellentesque tempus, erat nibh ornare dolor, id accumsan est risus at leo. In convallis felis at eros condimentum adipiscing aliquam nisi faucibus. Integer arcu ligula, porttitor in fermentum vitae, lacinia nec dui.
</p>
</body>
</html>
EOT
    );
    fclose($stdin);

    echo stream_get_contents($pipes[1]);
    fclose($pipes[1]);

    // It is important that you close any pipes before calling
    // proc_close in order to avoid a deadlock
    $return_value = proc_close($process);

    echo "command returned $return_value\n";
}


试验
乱数假文
“如果你是我的朋友，那你就是我的朋友。”
“没有人喜欢痛苦本身，没有人追求痛苦并想要拥有痛苦，仅仅因为痛苦……”

Lorem ipsum dolor sit amet，是一位杰出的献身者。佩伦茨克和萨皮恩在港口和港口的运输过程中发生了意外事故。Nam rhoncus mauris ac dui tristique bibendum。胎膜异位大鼠妊娠。两个受害者都是孕妇自由的受害者。我是奥奇奥奇。潜力悬钩子。交通工具的相位是从红色到温和的红色。


阿利夸姆·费吉亚，内克·坦帕斯·朗库斯，内克·多洛·瓦库特·厄洛斯，非佩林式的精英阶层。佩伦茨克的自由之路，ultrices调味品lorem。莫利斯先生。这是一次意外事故。这是猫科动物的本能，也是猫科动物的本能。欧伊斯莫德的库拉比图尔是一个充满活力的地方。在埃利特的多洛尔·费吉亚·奎斯（dolor feugiat vehicula quis）的一辆车上，有一辆米。Mauris lacus Mauris，laoreet non-molestie nec，一个空塔。纳勒姆·鲁特鲁姆，自由女神佩伦特斯，埃拉特·尼布·奥纳雷·多洛，我是狮子座的阿库曼·伊斯特·里索斯。在厄洛斯调味品店的猫咪康瓦尔舞曲中，阿利奎姆·尼西·福西布斯。整型arcu-ligula，发酵液中的porttitor，lacinia nec dui。

EOT
);
fclose（$stdin）；
回波流获取内容（$pipes[1]）；
fclose（$pipes[1]）；
//在呼叫之前关闭所有管道非常重要
//程序关闭以避免死锁
$return\u value=proc\u close（$process）；
echo“命令返回$return\u值\n”；
}

仅当您有权在服务器上运行可执行文件时，才可以选择使用lynx。然而，这样做并不被认为是一种好的做法。此外，在安全主机中，php进程被限制为无法生成bash会话，这是运行lynx所必需的

我能找到的完全用PHP编写的最完整的解决方案是类。这是一个部分，从

我尝试过的其他解决方案包括：

如果有人得到了完美的解决方案，请将其发回以供进一步参考

在c#中：

私有字符串StripHTML（字符串源）
{
尝试
{
字符串结果；
//删除HTML开发格式
//用空格替换换行符
//因为浏览器插入空间
结果=源。替换（“\r”和“）；
//用空格替换换行符
//因为浏览器插入空间
结果=结果。替换（“\n”和“）；
//删除步骤格式
result=result.Replace（“\t”，string.Empty）；
//删除重复空格，因为浏览器会忽略它们
结果=System.Text.RegularExpressions.Regex.Replace（结果，
@"( )+", " ");
//删除标题（首先通过清除属性进行准备）
结果=System.Text.RegularExpressions.Regex.Replace（结果，
@"])*>", "",
系统.Text.RegularExpressions.RegexOptions.IgnoreCase）；
结果=System.Text.RegularExpressions.Regex.Replace（结果，
@"()", "",
系统.Text.RegularExpressions.RegexOptions.IgnoreCase）；
结果=System.Text.RegularExpressions.Regex.Replace（结果，
（）。（），string.Empty，
系统.Text.RegularExpressions.RegexOptions.IgnoreCase）；
//删除所有脚本（首先通过清除属性进行准备）
结果=System.Text.RegularExpressions.Regex.Replace（结果，
@"])*>", "",
系统.Text.RegularExpressions.RegexOptions.IgnoreCase）；
结果=System.Text.RegularExpressions.Regex.Replace（结果，
@"()", "",
系统.Text.RegularExpressions.RegexOptions.IgnoreCase）；
//结果=System.Text.RegularExpressions.Regex.Replace（结果，
//         @"()([^(\.)])*()",
//字符串。空，
//系统.Text.RegularExpressions.RegexOptions.IgnoreCase）；
结果=System.Text.RegularExpressions.Regex.Replace（结果，
@（）。（），string.Empty，
系统.Text.RegularExpressions.RegexOptions.IgnoreCase）；
//删除所有样式（首先通过清除属性进行准备）
结果=System.Text.RegularExpressions.Regex.Replace（结果，
@"])*>", "",
系统.Text.RegularExpressions.RegexOptions.IgnoreCase）；
结果=System.Text.RegularExpressions.Regex.Replace（结果，
@"()", "",
系统.Text.RegularExpressions.RegexOptions.IgnoreCase）；
结果=System.Text.RegularExpressions.Regex.Replace（结果，
（）。（），string.Empty，
System.Text.RegularExpre
private string StripHTML(string source)
{
    try
    {
        string result;

        // Remove HTML Development formatting
        // Replace line breaks with space
        // because browsers inserts space
        result = source.Replace("\r", " ");
        // Replace line breaks with space
        // because browsers inserts space
        result = result.Replace("\n", " ");
        // Remove step-formatting
        result = result.Replace("\t", string.Empty);
        // Remove repeating spaces because browsers ignore them
        result = System.Text.RegularExpressions.Regex.Replace(result,
                                                              @"( )+", " ");

        // Remove the header (prepare first by clearing attributes)
        result = System.Text.RegularExpressions.Regex.Replace(result,
                 @"<( )*head([^>])*>", "<head>",
                 System.Text.RegularExpressions.RegexOptions.IgnoreCase);
        result = System.Text.RegularExpressions.Regex.Replace(result,
                 @"(<( )*(/)( )*head( )*>)", "</head>",
                 System.Text.RegularExpressions.RegexOptions.IgnoreCase);
        result = System.Text.RegularExpressions.Regex.Replace(result,
                 "(<head>).*(</head>)", string.Empty,
                 System.Text.RegularExpressions.RegexOptions.IgnoreCase);

        // remove all scripts (prepare first by clearing attributes)
        result = System.Text.RegularExpressions.Regex.Replace(result,
                 @"<( )*script([^>])*>", "<script>",
                 System.Text.RegularExpressions.RegexOptions.IgnoreCase);
        result = System.Text.RegularExpressions.Regex.Replace(result,
                 @"(<( )*(/)( )*script( )*>)", "</script>",
                 System.Text.RegularExpressions.RegexOptions.IgnoreCase);
        //result = System.Text.RegularExpressions.Regex.Replace(result,
        //         @"(<script>)([^(<script>\.</script>)])*(</script>)",
        //         string.Empty,
        //         System.Text.RegularExpressions.RegexOptions.IgnoreCase);
        result = System.Text.RegularExpressions.Regex.Replace(result,
                 @"(<script>).*(</script>)", string.Empty,
                 System.Text.RegularExpressions.RegexOptions.IgnoreCase);

        // remove all styles (prepare first by clearing attributes)
        result = System.Text.RegularExpressions.Regex.Replace(result,
                 @"<( )*style([^>])*>", "<style>",
                 System.Text.RegularExpressions.RegexOptions.IgnoreCase);
        result = System.Text.RegularExpressions.Regex.Replace(result,
                 @"(<( )*(/)( )*style( )*>)", "</style>",
                 System.Text.RegularExpressions.RegexOptions.IgnoreCase);
        result = System.Text.RegularExpressions.Regex.Replace(result,
                 "(<style>).*(</style>)", string.Empty,
                 System.Text.RegularExpressions.RegexOptions.IgnoreCase);

        // insert tabs in spaces of <td> tags
        result = System.Text.RegularExpressions.Regex.Replace(result,
                 @"<( )*td([^>])*>", "\t",
                 System.Text.RegularExpressions.RegexOptions.IgnoreCase);

        // insert line breaks in places of <BR> and <LI> tags
        result = System.Text.RegularExpressions.Regex.Replace(result,
                 @"<( )*br( )*>", "\r",
                 System.Text.RegularExpressions.RegexOptions.IgnoreCase);
        result = System.Text.RegularExpressions.Regex.Replace(result,
                 @"<( )*li( )*>", "\r",
                 System.Text.RegularExpressions.RegexOptions.IgnoreCase);

        // insert line paragraphs (double line breaks) in place
        // if <P>, <DIV> and <TR> tags
        result = System.Text.RegularExpressions.Regex.Replace(result,
                 @"<( )*div([^>])*>", "\r\r",
                 System.Text.RegularExpressions.RegexOptions.IgnoreCase);
        result = System.Text.RegularExpressions.Regex.Replace(result,
                 @"<( )*tr([^>])*>", "\r\r",
                 System.Text.RegularExpressions.RegexOptions.IgnoreCase);
        result = System.Text.RegularExpressions.Regex.Replace(result,
                 @"<( )*p([^>])*>", "\r\r",
                 System.Text.RegularExpressions.RegexOptions.IgnoreCase);

        // Remove remaining tags like <a>, links, images,
        // comments etc - anything that's enclosed inside < >
        result = System.Text.RegularExpressions.Regex.Replace(result,
                 @"<[^>]*>", string.Empty,
                 System.Text.RegularExpressions.RegexOptions.IgnoreCase);

        // replace special characters:
        result = System.Text.RegularExpressions.Regex.Replace(result,
                 @" ", " ",
                 System.Text.RegularExpressions.RegexOptions.IgnoreCase);

        result = System.Text.RegularExpressions.Regex.Replace(result,
                 @"&bull;", " * ",
                 System.Text.RegularExpressions.RegexOptions.IgnoreCase);
        result = System.Text.RegularExpressions.Regex.Replace(result,
                 @"&lsaquo;", "<",
                 System.Text.RegularExpressions.RegexOptions.IgnoreCase);
        result = System.Text.RegularExpressions.Regex.Replace(result,
                 @"&rsaquo;", ">",
                 System.Text.RegularExpressions.RegexOptions.IgnoreCase);
        result = System.Text.RegularExpressions.Regex.Replace(result,
                 @"&trade;", "(tm)",
                 System.Text.RegularExpressions.RegexOptions.IgnoreCase);
        result = System.Text.RegularExpressions.Regex.Replace(result,
                 @"&frasl;", "/",
                 System.Text.RegularExpressions.RegexOptions.IgnoreCase);
        result = System.Text.RegularExpressions.Regex.Replace(result,
                 @"&lt;", "<",
                 System.Text.RegularExpressions.RegexOptions.IgnoreCase);
        result = System.Text.RegularExpressions.Regex.Replace(result,
                 @"&gt;", ">",
                 System.Text.RegularExpressions.RegexOptions.IgnoreCase);
        result = System.Text.RegularExpressions.Regex.Replace(result,
                 @"&copy;", "(c)",
                 System.Text.RegularExpressions.RegexOptions.IgnoreCase);
        result = System.Text.RegularExpressions.Regex.Replace(result,
                 @"&reg;", "(r)",
                 System.Text.RegularExpressions.RegexOptions.IgnoreCase);
        // Remove all others. More can be added, see
        // http://hotwired.lycos.com/webmonkey/reference/special_characters/
        result = System.Text.RegularExpressions.Regex.Replace(result,
                 @"&(.{2,6});", string.Empty,
                 System.Text.RegularExpressions.RegexOptions.IgnoreCase);

        // for testing
        //System.Text.RegularExpressions.Regex.Replace(result,
        //       this.txtRegex.Text,string.Empty,
        //       System.Text.RegularExpressions.RegexOptions.IgnoreCase);

        // make line breaking consistent
        result = result.Replace("\n", "\r");

        // Remove extra line breaks and tabs:
        // replace over 2 breaks with 2 and over 4 tabs with 4.
        // Prepare first to remove any whitespaces in between
        // the escaped characters and remove redundant tabs in between line breaks
        result = System.Text.RegularExpressions.Regex.Replace(result,
                 "(\r)( )+(\r)", "\r\r",
                 System.Text.RegularExpressions.RegexOptions.IgnoreCase);
        result = System.Text.RegularExpressions.Regex.Replace(result,
                 "(\t)( )+(\t)", "\t\t",
                 System.Text.RegularExpressions.RegexOptions.IgnoreCase);
        result = System.Text.RegularExpressions.Regex.Replace(result,
                 "(\t)( )+(\r)", "\t\r",
                 System.Text.RegularExpressions.RegexOptions.IgnoreCase);
        result = System.Text.RegularExpressions.Regex.Replace(result,
                 "(\r)( )+(\t)", "\r\t",
                 System.Text.RegularExpressions.RegexOptions.IgnoreCase);
        // Remove redundant tabs
        result = System.Text.RegularExpressions.Regex.Replace(result,
                 "(\r)(\t)+(\r)", "\r\r",
                 System.Text.RegularExpressions.RegexOptions.IgnoreCase);
        // Remove multiple tabs following a line break with just one tab
        result = System.Text.RegularExpressions.Regex.Replace(result,
                 "(\r)(\t)+", "\r\t",
                 System.Text.RegularExpressions.RegexOptions.IgnoreCase);
        // Initial replacement target string for line breaks
        string breaks = "\r\r\r";
        // Initial replacement target string for tabs
        string tabs = "\t\t\t\t\t";
        for (int index = 0; index < result.Length; index++)
        {
            result = result.Replace(breaks, "\r\r");
            result = result.Replace(tabs, "\t\t\t\t");
            breaks = breaks + "\r";
            tabs = tabs + "\t";
        }

        // That's it.
        return result;
    }
    catch
    {
        MessageBox.Show("Error");
        return source;
    }
}

//From https://stackoverflow.com/questions/1930297/html-to-plain-text-for-email/23988241#23988241
//converted from c# to PHP
class HtmlToText
{
    public static function stripHTML($source)
    {
        // Remove HTML Development formatting
        // Replace line breaks with space
        // because browsers inserts space
        $result = str_replace("\r",  " ",$source );
        // Replace line breaks with space
        // because browsers inserts space
        $result = str_replace("\n",  " ",$result );
        // Remove step-formatting
        $result = str_replace("\t",  "",$result );
        // Remove repeating spaces because browsers ignore them
        $result = preg_replace("/( )+/im",  " ", $result);


        // Remove html-Tag (prepare first by clearing attributes)
        $result = preg_replace("/<( )*html([^>])*>\s*/im",  "<html>", $result);

        $result = preg_replace("/(<( )*(\/)( )*html( )*>)/im",  "</html>", $result);

        $result = preg_replace("/(<html>)|(<\/html>)/im",  "", $result);

        // Remove the header (prepare first by clearing attributes)
        $result = preg_replace("/<( )*head([^>])*>/im",  "<head>", $result);

        $result = preg_replace("/(<( )*(\/)( )*head( )*>)/im",  "</head>", $result);

        $result = preg_replace("/(<head>).*(<\/head>)/im",  "", $result);


        // remove all scripts (prepare first by clearing attributes)
        $result = preg_replace("/<( )*script([^>])*>/im",  "<script>", $result);

        $result = preg_replace("/(<( )*(\/)( )*script( )*>)/im",  "</script>", $result);

        //$result = System.Text.RegularExpressions.Regex.Replace($result,
        //         "(<script>)([^(<script>\.</script>)])*(</script>)",
        //         "",
        //         System.Text.RegularExpressions.RegexOptions.IgnoreCase);
        $result = preg_replace("/(<script>).*(<\/script>)/im",  "", $result);


        // remove all styles (prepare first by clearing attributes)
        $result = preg_replace("/<( )*style([^>])*>/im",  "<style>", $result);

        $result = preg_replace("/(<( )*(\/)( )*style( )*>)/im",  "</style>", $result);

        $result = preg_replace("/(<style>).*(<\/style>)/im",  "", $result);


        // insert tabs in spaces of <td> tags
        $result = preg_replace("/<( )*td([^>])*>/im",  "\t", $result);


        // insert line breaks in places of <BR> and <LI> tags
        $result = preg_replace("/<( )*br( )*\/?>/im",  "\r", $result);

        $result = preg_replace("/<( )*li( )*>/im",  "\r", $result);


        // insert line paragraphs (double line breaks) in place
        // if <P>, <DIV> and <TR> tags
        $result = preg_replace("/<( )*div([^>])*>/im",  "\r\r", $result);

        $result = preg_replace("/<( )*tr([^>])*>/im",  "\r\r", $result);

        $result = preg_replace("/<( )*p([^>])*>/im",  "\r\r", $result);


        // Remove remaining tags like <a>, links, images,
        // comments etc - anything that's enclosed inside < >
        $result = preg_replace("/<[^>]*>/im",  "", $result);


        // replace special characters:
        $result = preg_replace("/ /im",  " ", $result);


        $result = preg_replace("/&bull;/im",  " * ", $result);

        $result = preg_replace("/&lsaquo;/im",  "<", $result);

        $result = preg_replace("/&rsaquo;/im",  ">", $result);

        $result = preg_replace("/&trade;/im",  "(tm)", $result);

        $result = preg_replace("/&frasl;/im",  "/", $result);

        $result = preg_replace("/&lt;/im",  "<", $result);

        $result = preg_replace("/&gt;/im",  ">", $result);

        $result = preg_replace("/&copy;/im",  "(c)", $result);

        $result = preg_replace("/&reg;/im",  "(r)", $result);

        // Remove all others. More can be added, see
        // http://hotwired.lycos.com/webmonkey/reference/special_characters/
        $result = preg_replace("/&(.{2,6});/im",  "", $result);


        // for testing
        //System.Text.RegularExpressions.Regex.Replace($result,
        //       this.txtRegex.Text,"",
            //       System.Text.RegularExpressions.RegexOptions.IgnoreCase);

        // make line breaking consistent
        $result = str_replace("\n",  "\r",$result );

        // Remove extra line breaks and tabs:
        // replace over 2 breaks with 2 and over 4 tabs with 4.
        // Prepare first to remove any whitespaces in between
        // the escaped characters and remove redundant tabs in between line breaks
        $result = preg_replace("/(\r)( )+(\r)/im",  "\r\r", $result);

        $result = preg_replace("/(\t)( )+(\t)/im",  "\t\t", $result);

        $result = preg_replace("/(\t)( )+(\r)/im",  "\t\r", $result);

        $result = preg_replace("/(\r)( )+(\t)/im",  "\r\t", $result);

        // Remove redundant tabs
        $result = preg_replace("/(\r)(\t)+(\r)/im",  "\r\r", $result);

        // Remove multiple tabs following a line break with just one tab
        $result = preg_replace("/(\r)(\t)+/im",  "\r\t", $result);

        // Initial replacement target string for line breaks
        $breaks = "\r\r\r";
        // Initial replacement target string for tabs
        $tabs = "\t\t\t\t\t";
        for ($index = 0; $index < strlen($result); $index++)
        {
            $result = str_replace($breaks,  "\r\r",$result );
            $result = str_replace($tabs,  "\t\t\t\t",$result );
            $breaks = $breaks . "\r";
            $tabs = $tabs . "\t";
        }

        //remove spaces at the beginning of a line
        $result = preg_replace("/^ +/im",  "", $result);

        //line breaks at the beginning/end is probably unwanted. Coluld be left over by removing <html>/<head>/<body>
        $result = trim($result);

        // That's it.
        return $result;
    }
}