Php 从pdf文档中读取和计算单词_Php_Pdf_Powerpoint

Php 从pdf文档中读取和计算单词

php pdf

Php 从pdf文档中读取和计算单词,php,pdf,powerpoint,Php,Pdf,Powerpoint,我一直在做这个不同文件扩展名的文本提取项目，但是我对pdf和powerpoint最痛苦，这里是pdf的代码这里的任何人都知道如何使用任何工具或库tcpdf、xpdf或fpdfi从现有pdf文档中读取文本，因为我还没有看到任何从pdf或ppt读取文本的精确解决方案，但请不要使用zend解决方案 function pdf2txt($filename){ $data = getFileData($filename); // grab objects and then grab

我一直在做这个不同文件扩展名的文本提取项目，但是我对pdf和powerpoint最痛苦，这里是pdf的代码这里的任何人都知道如何使用任何工具或库tcpdf、xpdf或fpdfi从现有pdf文档中读取文本，因为我还没有看到任何从pdf或ppt读取文本的精确解决方案，但请不要使用zend解决方案

function pdf2txt($filename){

    $data = getFileData($filename);

    // grab objects and then grab their contents (chunks)
    $a_obj = getDataArray($data,"obj","endobj");
    foreach($a_obj as $obj){

        $a_filter = getDataArray($obj,"<<",">>");
        if (is_array($a_filter)){
            $j++;
            $a_chunks[$j]["filter"] = $a_filter[0];

            $a_data = getDataArray($obj,"stream\r\n","endstream");
            if (is_array($a_data)){
                $a_chunks[$j]["data"] = substr($a_data[0],strlen("stream\r\n"),strlen($a_data[0])-strlen("stream\r\n")-strlen("endstream"));
            }
        }
    }

    // decode the chunks
    foreach($a_chunks as $chunk){

        // look at each chunk and decide how to decode it - by looking at the contents of the filter
        $a_filter = split("/",$chunk["filter"]);

        if ($chunk["data"]!=""){
            // look at the filter to find out which encoding has been used          
            if (substr($chunk["filter"],"FlateDecode")!==false){
                $data =@ gzuncompress($chunk["data"]);
                if (trim($data)!=""){
                    $result_data .= ps2txt($data);
                } else {

                    //$result_data .= "x";
                }
            }
        }
    }

    return $result_data;

}


// Function    : ps2txt()
// Arguments   : $ps_data - postscript data you want to convert to plain text
// Description : Does a very basic parse of postscript data to
//             :  return the plain text
// Author      : Jonathan Beckett, 2005-05-02
function ps2txt($ps_data){
    $result = "";
    $a_data = getDataArray($ps_data,"[","]");
    if (is_array($a_data)){
        foreach ($a_data as $ps_text){
            $a_text = getDataArray($ps_text,"(",")");
            if (is_array($a_text)){
                foreach ($a_text as $text){
                    $result .= substr($text,1,strlen($text)-2);
                }
            }
        }
    } else {
        // the data may just be in raw format (outside of [] tags)
        $a_text = getDataArray($ps_data,"(",")");
        if (is_array($a_text)){
            foreach ($a_text as $text){
                $result .= substr($text,1,strlen($text)-2);
            }
        }
    }
    return $result;
}


// Function    : getFileData()
// Arguments   : $filename - filename you want to load
// Description : Reads data from a file into a variable
//               and passes that data back
// Author      : Jonathan Beckett, 2005-05-02
function getFileData($filename){
    $handle = fopen($filename,"rb");
    $data = fread($handle, filesize($filename));
    fclose($handle);
    return $data;
}


// Function    : getDataArray()
// Arguments   : $data       - data you want to chop up
//               $start_word - delimiting characters at start of each chunk
//               $end_word   - delimiting characters at end of each chunk
// Description : Loop through an array of data and put all chunks
//               between start_word and end_word in an array
// Author      : Jonathan Beckett, 2005-05-02
function getDataArray($data,$start_word,$end_word){

    $start = 0;
    $end = 0;
    unset($a_result);

    while ($start!==false && $end!==false){
        $start = strpos($data,$start_word,$end);
        if ($start!==false){
            $end = strpos($data,$end_word,$start);
            if ($end!==false){
                // data is between start and end
                $a_result[] = substr($data,$start,$end-$start+strlen($end_word));
            }
        }
    }
    return $a_result;
}
this one is for powerpoint i found here some where but that isnt working also
function parsePPT($filename) {
// This approach uses detection of the string "chr(0f).Hex_value.chr(0x00).chr(0x00).chr(0x00)" to find text strings, which are then terminated by another NUL chr(0x00). [1] Get text between delimiters [2] 
    $fileHandle = fopen($filename, "r");
    $line = @fread($fileHandle, filesize($filename));
    $lines = explode(chr(0x0f),$line);
    $outtext = '';

    foreach($lines as $thisline) {
        if (strpos($thisline, chr(0x00).chr(0x00).chr(0x00)) == 1) {
            $text_line = substr($thisline, 4);
            $end_pos   = strpos($text_line, chr(0x00));
            $text_line = substr($text_line, 0, $end_pos);
            $text_line = preg_replace("/[^a-zA-Z0-9\s\,\.\-\n\r\t@\/\_\(\)]/","",$text_line);
            if(substr($text_line,0,20)!="Click to edit Master")
            if (strlen($text_line) > 1) {
                $outtext.= substr($text_line, 0, $end_pos)."\n<br>";
            }
        }
    }
return $outtext;
}

函数pdf2txt（$filename）{ $data=getFileData（$filename）； //抓取对象，然后抓取其内容（块） $a_obj=getDataArray（$data，“obj”，“endobj”）； foreach（$a_obj作为$obj）{ $a_filter=getDataArray（$obj，“”）； if（is_数组（$a_过滤器））{ $j++； $a_块[$j][“过滤器”]=$a_过滤器[0]； $a_data=getDataArray（$obj，“stream\r\n”，“endstream”）； if（is_数组（$a_数据））{ $a_chunks[$j][“data”]=substr（$a_data[0]，strlen（“stream\r\n”），strlen（$a_data[0]）-strlen（“stream\r\n”）-strlen（“endstream”）； } } } //解码块 foreach（$a_chunk作为$chunk）{ //查看每个块并决定如何解码-通过查看过滤器的内容 $a_filter=split（“/”，$chunk[“filter”]）；如果（$chunk[“data”]！=“”）{ //查看过滤器以找出使用了哪种编码 if（substr（$chunk[“filter”]，“flateCode”）！==false）{ $data=@gzuncompress（$chunk[“data”]）；如果（修剪（$data）！=“”）{ $result_data.=ps2text（$data）； }否则{ //$result_data.=“x”； } } } } 返回$result\u数据； } //函数：ps2text（） //参数：$ps_data-要转换为纯文本的postscript数据 //描述：对postscript数据进行非常基本的解析，以 //：返回纯文本 //作者：乔纳森·贝克特，2005-05-02 函数ps2text（$ps\u数据）{ $result=“”； $a_data=getDataArray（$ps_data，“[”，“]”）； if（is_数组（$a_数据））{ foreach（$a\u数据作为$ps\u文本）{ $a_text=getDataArray（$ps_text，“（”，”）； if（is_数组（$a_文本））{ foreach（$a_text作为$text）{ $result.=substr（$text，1，strlen（$text）-2）； } } } }否则{ //数据可能只是原始格式（在[]标记之外） $a_text=getDataArray（$ps_data，“（”，”）； if（is_数组（$a_文本））{ foreach（$a_text作为$text）{ $result.=substr（$text，1，strlen（$text）-2）； } } } 返回$result； } //函数：getFileData（） //参数：$filename-要加载的文件名 //描述：将文件中的数据读入变量 //然后把数据传回去 //作者：乔纳森·贝克特，2005-05-02 函数getFileData（$filename）{ $handle=fopen（$filename，“rb”）； $data=fread（$handle，filesize（$filename））； fclose（$handle）；返回$data； } //函数：getDataArray（） //参数：$data-要切碎的数据 //$start_word-在每个块的开头分隔字符 //$end_word-在每个块的末尾分隔字符 //描述：循环遍历数据数组并放置所有块 //在数组中的起始字和结束字之间 //作者：乔纳森·贝克特，2005-05-02 函数getDataArray（$data、$start\u word、$end\u word）{ $start=0； $end=0；未设置（$a_结果）； while（$start！==false&&$end！==false）{ $start=strpos（$data，$start\u word，$end）；如果（$start！==false）{ $end=strpos（$data，$end\u word，$start）；如果（$end！==false）{ //数据介于开始和结束之间 $a_result[]=substr（$data，$start，$end-$start+strlen（$end_word））； } } } 返回$a_结果； } 这是我在这里找到的powerpoint，但它也不起作用函数parsePPT（$filename）{ //这种方法使用对字符串“chr（0f）.Hex_value.chr（0x00）.chr（0x00）.chr（0x00）”的检测来查找文本字符串，然后这些字符串被另一个NUL chr（0x00）终止。[1]获取分隔符之间的文本[2] $fileHandle=fopen（$filename，“r”）； $line=@fread（$fileHandle，filesize（$filename））； $lines=分解（chr（0x0f），$line）； $outtext=''； foreach（$行作为$thisline）{ if（strpos（$thisline，chr（0x00）.chr（0x00）.chr（0x00））=1{ $text_line=substr（$thisline，4）； $end_pos=strpos（$text_行，chr（0x00））； $text\u line=substr（$text\u line，0，$end\u pos）； $text\u line=preg\u replace（“/[^a-zA-Z0-9\s\，\.-\n\r\t@/\\\（\）]/”，“，”，$text\u line）；如果（substr（$text_line，0,20）！=“单击以编辑主控形状”）如果（strlen（$text_line）>1）{ $outtext.=substr（$text\u line，0，$end\u pos）。“\n
”； } } } 返回$outtext； }

你为什么要重新发明轮子？您可以使用ie.xpdf或类似工具提取PDF中的文本数据，然后处理该操作产生的纯文本文件。几乎任何包含文本的文件格式都可以使用相同的方法（即，首先转换为纯文本版本，然后再进行处理）

如果你选择这个解决方案，这可能是一本有趣的书