Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/regex/17.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Php 将RTF转换为纯文本_Php_Regex_Rtf - Fatal编程技术网

Php 将RTF转换为纯文本

Php 将RTF转换为纯文本,php,regex,rtf,Php,Regex,Rtf,我有一个ERP系统,它以RTF格式存储文本,我试图从中提取纯文本 我在谷歌上搜索了一下,找到了一个解决方案,比如和其他一些使用REGEX替换的解决方案,但它们似乎都不起作用。我总是得到NULL或}或完全错误的东西 这是我试过的正则表达式: $matches=array('/\{\\\\(.+?)\}/','/\\\\\(.+?)\b/'); $row['text']=preg_replace($matches,,$row['text']); 但是它返回:}} 这是我的RTF数据: {\rtf1

我有一个ERP系统,它以RTF格式存储文本,我试图从中提取纯文本

我在谷歌上搜索了一下,找到了一个解决方案,比如和其他一些使用REGEX替换的解决方案,但它们似乎都不起作用。我总是得到
NULL
}
或完全错误的东西

这是我试过的正则表达式:

$matches=array('/\{\\\\(.+?)\}/','/\\\\\(.+?)\b/');
$row['text']=preg_replace($matches,,$row['text']);
但是它返回:
}}

这是我的RTF数据:

{\rtf1\deff0{\fonttbl{\f0 Calibri;}{\f1 Arial;}{\colortbl;\red0\green0\blue255;}{\*\defchp\fs22}{\*\listoverridetable}{\stylesheet{\ql\fs22 Normal;}{\*\cs1\f1\fs20默认段落字体;}{\*\cs2\sbasedon1\f1\fs20行号;}{\*\cs3\ul fs22\cf1超链接;}{\*\ts4\tsrowd\fs22\ql\trautofit1\tscellpaddfl3\tscellpaddl108\tscellpaddfr3\tscellpaddr108\tsvertalt\cltxlrtb Normal Table;}{\*\ts5\tsrowd\sbasedon4\fs22\ql\trbrdrt\brdrs\brdrw10\trdrdrs\brdrw10\trbrdrb\brdrs\brdrw10\trbrdrr\brdrs\brdrw10\trbrdrh\brdrs\brdrw10\trbrdrv\brdrs\brdrw10\trautofits1\tscellpaddfl3\tscellpaddl108\tscellpaddfr108\tscellpattalt\clrtb表1;\sectd\stcrdr\brdr\brdrw10\brdr{\f1\fs20\cf0迁移文件服务器innerhalb derselben order einer vertrauten Dom\u228'e4ne}\f1\fs20\pard\plain\ql{\f1\fs20\cf0 Anpassung der Laufwerksfreigaben}\f1\fs20\pard\plain\ql{\f1\fs20\cf0 freigabeners tellung wie bestland(weitere Absprachen hierzu m\f6glich)}\f1\fs20\par(e) :}\f1\fs20\par\pard\plain\ql{\f1\fs20\cf0 Hostname Zielsystem:}\f1\fs20\pard\plain\ql{\f1\fs20\cf0 bekannets Datenvolumen:}\f1\fs20\pard\plain\ql{\f1\fs20\cf0 Clientseitige Nacharbeiten aufgrund fest vergebener Einstellungen}\f1\fs20\pard\plain\ql{f1\fs20\pard\cf0 erfolgen nach aufwrager auften}\f1\fs20\par\pard\plain\ql{\f1\fs20\cf0我在hingewiesen的一份报告中指出,日期与时间之比为1:1。日期与时间之比为1:1{\f1\fs20\cf0 VORAUSETZUNGEN zur Zusatzaufwandsfreien Durchf\u252\fchrung:}\f1\fs20\pard\plain\ql{\f1\fs20\cf0千兆交换zwischen allen Quell-und ZIELSYSTEM,Vollzugriff auf den migrierenden Datenbestand}\f1\fs20\pard\plain\ql\f1\fs20\par}
编辑2019:对于所有发现这个问题的人,我使用这个单一类项目已经4年了,没有任何问题

经过一番脑力劳动,我为您找到了一个解决方案:

试试这个正则表达式:

"{\*?\\.+(;})|\s?\\[A-Za-z0-9]+|\s?{\s?\\[A-Za-z0-9]+\s?|\s?}\s?"
这意味着将代码替换为

$count = null;    
$matches = array('"{\*?\\.+(;})|\s?\\[A-Za-z0-9]+|\s?{\s?\\[A-Za-z0-9]+\s?|\s?}\s?"');
$row['text'] = preg_replace($matches,'',$row['text'], -1, $count);

您可以在此处找到Rtf文本提取器:

下面是一个关于如何使用它的示例:

include ( 'path/to/RtfTexter.phpclass' ) ;

$doc = new RtfTexter ( 'sample.rtf' ) ;
echo $doc -> AsString ( ) ;               // Echo text contents to stdout
$doc -> SaveTo ( 'sample.txt' ) ;         // Save text contents to file 'sample.txt'

我在这里为其他人发布了一个解决这个问题的通用方法

public static function converToPlain($text){
    $text = preg_replace('"{\*?\\\\.+(;})|\\s?\\\[A-Za-z0-9]+|\\s?{\\s?\\\[A-Za-z0-9‹]+\\s?|\\s?}\\s?"', '', $text);
    return $text;
}

大家好,我正在写这段代码读取rtf文件纯文本这段代码工作100%

PHP代码:

$text = file_get_contents('testfile.rtf');
if (!strlen($text)) {
 echo "bad file";
 exit();

}
// we'll try to fix up the parts of the rtf as best we can
// clean up the file a little to simplify parsing
$text=str_replace("\r",' ',$text); // returns
$text=str_replace("\n",' ',$text); // new lines
$text=str_replace('  ',' ',$text); // double spaces
$text=str_replace('  ',' ',$text); // double spaces
$text=str_replace('  ',' ',$text); // double spaces
$text=str_replace('  ',' ',$text); // double spaces
$text=str_replace('} {','}{',$text); // embedded spaces
// skip over the heading stuff
$j=strpos($text,'{',1); // skip ahead to the first part of the header

$loc=1;
$t="";

$ansa="";
$len=strlen($text);
getpgraph(); // skip by the first paragrap

while($j<$len) {
 $c=substr($text,$j,1);
 if ($c=="\\") {
 // have a tag
 $tag=gettag();
 if (strlen($tag)>0) {
 // process known tags
 switch ($tag) {
 case 'par':
 $ansa.="\r\n";
 break;
 // ad a list of common tags
 // parameter tags
 case 'spriority1':
 case 'fprq2':
 case 'author':
 case 'operator':
 case 'sqformat':
 case 'company':
 case 'xmlns1':
 case 'wgrffmtfilter':
 case 'pnhang':
 case 'themedata':
 case 'colorschememapping':
 $tt=gettag();
 break;
 case '*':
 case 'info':
 case 'stylesheet':
 // gets to end of paragraph
 $j--;
 getpgraph();
 default:
 // ignore the tag
 }
 }
 } else {
 $ansa.=$c;
 }
 $j++;
}
$ansa=str_replace('{','',$ansa);
$ansa=str_replace('}','',$ansa);
echo "<pre>$ansa</pre>";

function getpgraph() {
 // if the first char after a tag is { then throw out the entire paragraph
 // this has to be nested
 global $text;
 global $j;
 global $len;
 $nest=0;
 while(true) {
 $j++;
 if ($j>=$len) break;
 if (substr($text,$j,1)=='}') {
 if ($nest==0) return;
 $nest--;
 }
 if (substr($text,$j,1)=='{') {
 $nest++;
 }
 }
 return;
}

function gettag() {
 // gets the text following the / character or gets the param if it there
 global $text;
 global $j;
 global $len;
 $tag='';
 while(true) {
 $j++;
 if ($j>=$len) break;
 $c=substr($text,$j,1);
 if ($c==' ') break;
 if ($c==';') break;
 if ($c=='}') break;
 if ($c=="\\") {
 $j--;
 break;
 }
 if ($c=="{") {
 //getpgraph();
 break;
 }
 if ((($c>='0')&&($c<='9'))||(($c>='a')&&($c<='z'))||(($c>='A')&&($c<='Z'))||$c=="'"||$c=="-"||$c=="*" ){
 $tag=$tag.$c;
 } else {
 // end of tag
 $j--;
 break;
 }
 }
 return $tag;

}
$text=file\u get\u contents('testfile.rtf');
如果(!strlen($text)){
回显“坏文件”;
退出();
}
//我们将尽力修复rtf的部件
//稍微清理一下文件以简化解析
$text=str_replace(“\r”,“”,$text);//返回
$text=str_replace(“\n”,“”,$text);//新行
$text=str_replace(“”,,$text);//双空格
$text=str_replace(“”,,$text);//双空格
$text=str_replace(“”,,$text);//双空格
$text=str_replace(“”,,$text);//双空格
$text=str_replace('}{','}{',$text);//嵌入的空格
//跳过标题
$j=strpos($text,'{',1);//跳到标题的第一部分
$loc=1;
$t=“”;
$ansa=“”;
$len=strlen($text);
getpgraph();//按第一段跳过
而($j0){
//处理已知标签
交换机($tag){
案例“par”:
$ansa.=“\r\n”;
打破
//添加常用标记的列表
//参数标签
案例“精神1”:
案例“fprq2”:
案例“作者”:
“操作员”案例:
案例“sqformat”:
“公司”一案:
案例“xmlns1”:
案例“wgrffmtfilter”:
个案"pnhang":
案例“主题数据”:
案例“colorschememapping”:
$tt=gettag();
打破
案例“*”:
案例“信息”:
案例“样式表”:
//到达段落末尾
$j--;
getpgraph();
违约:
//忽略标签
}
}
}否则{
$ansa.=$c;
}
$j++;
}
$ansa=str_replace('{','$ansa);
$ansa=str_replace('}','$ansa);
回声“$ansa”;
函数getpgraph(){
//如果标记后的第一个字符是{,则抛出整个段落
//这必须嵌套
全球$文本;
全球$j;
全球$len;
$nest=0;
while(true){
$j++;
如果($j>=$len)中断;
if(substr($text,$j,1)='}'){
如果($nest==0)返回;
$nest--;
}
如果(substr($text,$j,1)='{'){
$nest++;
}
}
返回;
}
函数gettag(){
//获取/字符后面的文本,或获取参数(如果有)
全球$文本;
全球$j;
全球$len;
$tag='';
while(true){
$j++;
如果($j>=$len)中断;
$c=substr($text,$j,1);
如果($c='')中断;
如果($c==';')中断;
如果($c='}')中断;
如果($c==“\\”){
$j--;
打破
}
如果($c==“{”){
//getpgraph();
打破
}

如果(($c>='0')&($c='a')&($c='a')&($c我尝试过@Anurag Prashant建议,但有时无效。例如,此rtf未正确转换:

$plain = $rtfSource;

// we have to remove all line breaks, otherwise
// the RTF>TXT regexp below doesn't work correctly.
$plain = preg_replace( '/\r|\n/', '', $plain);

// extract the images
// example: {\pict\pngblip\picw1166\pich190\picwgoal8071\pichgoal1315 89504e470d0a1a0a00...454e44ae426082}
// the hexadecimal code for the image starts after the
// whitespace and runs until the first } that we encounter.
// then it has to be converted into base64.
$imgHtml = '';
$imgMatches = array();
$imgRegex = '/{\\\\pict\\\\pngblip\\\\[a-z0-9]+\\\\[a-z0-9]+\\\\[a-z0-9]+\\\\[a-z0-9]+ ([a-z0-9]+)}/';
preg_match_all($imgRegex, $plain, $imgMatches);
if (count($imgMatches[1])) {
    for ($i=0; $i < count($imgMatches[1]); $i++) {
        $imgHtml .= '<img src="data:image/png;base64, ' . base64_encode(pack('H*', $imgMatches[1][$i])) . '">';
    }
}

// remove those images (or else their hex code is still displayed as text)
$plain = preg_replace($imgRegex, '', $plain);

// RTF>TXT (https://stackoverflow.com/a/42525858/357546)
$plain = preg_replace('"{\*?\\\\.+(;})|\\s?\\\[A-Za-z0-9]+|\\s?{\\s?\\\[A-Za-z0-9‹]+\\s?|\\s?}\\s?"', '', $plain);

// special characters; for a full list, see:
// https://www.oreilly.com/library/view/rtf-pocket-guide/9781449302047/ch04.html
$plain = str_replace("\'3f", '?', $plain);
$plain = str_replace("\'80", '€', $plain);
$plain = str_replace("\'a8", '¨', $plain);
$plain = str_replace("\'ab", '«', $plain);
$plain = str_replace("\'ae", '®', $plain);
$plain = str_replace("\'b0", '°', $plain);
$plain = str_replace("\'bb", '»', $plain);
$plain = str_replace("\'c4", 'Ä', $plain);
$plain = str_replace("\'c9", 'É', $plain);
$plain = str_replace("\'d6", 'Ö', $plain);
$plain = str_replace("\'dc", 'Ü', $plain);
$plain = str_replace("\'df", 'ß', $plain);
$plain = str_replace("\'e0", 'à', $plain);
$plain = str_replace("\'e2", 'â', $plain);
$plain = str_replace("\'e4", 'ä', $plain);
$plain = str_replace("\'e7", 'ç', $plain);
$plain = str_replace("\'e8", 'è', $plain);
$plain = str_replace("\'e9", 'é', $plain);
$plain = str_replace("\'ea", 'ê', $plain);
$plain = str_replace("\'eb", 'ë', $plain);
$plain = str_replace("\'ee", 'î', $plain);
$plain = str_replace("\'f4", 'ô', $plain);
$plain = str_replace("\'f6", 'ö', $plain);
$plain = str_replace("\'f8", 'ø', $plain);
$plain = str_replace("\'fb", 'û', $plain);
$plain = str_replace("\'fc", 'ü', $plain);

// a bit of cleaning
$plain = trim($plain);
$plain = preg_replace('/^-0 /', '', $plain);
$plain .= $imgHtml;

echo $plain;
下面是一个似乎工作得更好的php正则表达式:

/(\{.*\}}}(\\\\(?!')\S+/m
公共静态函数convertoplan($text)
{
$text=preg\u replace(“/(\{.*\})\}{124;(\\\(?!')\S+/m)”,“”,$text);
返回$text;
}

我知道这是一个老问题,但最近我不得不做一些类似的事情,这一页上的各种答案对我帮助很大。但是没有一个答案完全满足我的需要,所以我不得不将其中的几个结合起来,然后添加一点我自己的答案

代码如下:

  • 将RTF转换为TXT(非HTML)
  • 处理重音或特殊字符(é、ç等)
  • 还提取PNG
    $plain = $rtfSource;
    
    // we have to remove all line breaks, otherwise
    // the RTF>TXT regexp below doesn't work correctly.
    $plain = preg_replace( '/\r|\n/', '', $plain);
    
    // extract the images
    // example: {\pict\pngblip\picw1166\pich190\picwgoal8071\pichgoal1315 89504e470d0a1a0a00...454e44ae426082}
    // the hexadecimal code for the image starts after the
    // whitespace and runs until the first } that we encounter.
    // then it has to be converted into base64.
    $imgHtml = '';
    $imgMatches = array();
    $imgRegex = '/{\\\\pict\\\\pngblip\\\\[a-z0-9]+\\\\[a-z0-9]+\\\\[a-z0-9]+\\\\[a-z0-9]+ ([a-z0-9]+)}/';
    preg_match_all($imgRegex, $plain, $imgMatches);
    if (count($imgMatches[1])) {
        for ($i=0; $i < count($imgMatches[1]); $i++) {
            $imgHtml .= '<img src="data:image/png;base64, ' . base64_encode(pack('H*', $imgMatches[1][$i])) . '">';
        }
    }
    
    // remove those images (or else their hex code is still displayed as text)
    $plain = preg_replace($imgRegex, '', $plain);
    
    // RTF>TXT (https://stackoverflow.com/a/42525858/357546)
    $plain = preg_replace('"{\*?\\\\.+(;})|\\s?\\\[A-Za-z0-9]+|\\s?{\\s?\\\[A-Za-z0-9‹]+\\s?|\\s?}\\s?"', '', $plain);
    
    // special characters; for a full list, see:
    // https://www.oreilly.com/library/view/rtf-pocket-guide/9781449302047/ch04.html
    $plain = str_replace("\'3f", '?', $plain);
    $plain = str_replace("\'80", '€', $plain);
    $plain = str_replace("\'a8", '¨', $plain);
    $plain = str_replace("\'ab", '«', $plain);
    $plain = str_replace("\'ae", '®', $plain);
    $plain = str_replace("\'b0", '°', $plain);
    $plain = str_replace("\'bb", '»', $plain);
    $plain = str_replace("\'c4", 'Ä', $plain);
    $plain = str_replace("\'c9", 'É', $plain);
    $plain = str_replace("\'d6", 'Ö', $plain);
    $plain = str_replace("\'dc", 'Ü', $plain);
    $plain = str_replace("\'df", 'ß', $plain);
    $plain = str_replace("\'e0", 'à', $plain);
    $plain = str_replace("\'e2", 'â', $plain);
    $plain = str_replace("\'e4", 'ä', $plain);
    $plain = str_replace("\'e7", 'ç', $plain);
    $plain = str_replace("\'e8", 'è', $plain);
    $plain = str_replace("\'e9", 'é', $plain);
    $plain = str_replace("\'ea", 'ê', $plain);
    $plain = str_replace("\'eb", 'ë', $plain);
    $plain = str_replace("\'ee", 'î', $plain);
    $plain = str_replace("\'f4", 'ô', $plain);
    $plain = str_replace("\'f6", 'ö', $plain);
    $plain = str_replace("\'f8", 'ø', $plain);
    $plain = str_replace("\'fb", 'û', $plain);
    $plain = str_replace("\'fc", 'ü', $plain);
    
    // a bit of cleaning
    $plain = trim($plain);
    $plain = preg_replace('/^-0 /', '', $plain);
    $plain .= $imgHtml;
    
    echo $plain;