用PHP读/写MS Word文件_Php_Ms Word_Read Write

用PHP读/写MS Word文件

php ms-word

用PHP读/写MS Word文件,php,ms-word,read-write,Php,Ms Word,Read Write,不使用COM对象就可以在PHP中读写Word（2003和2007）文件吗？我知道我可以： $file = fopen('c:\file.doc', 'w+'); fwrite($file, $text); fclose(); 但是Word会将其作为HTML文件而不是本机.doc文件来读取。我不知道如何在PHP中读取本机Word文档，但是如果您想在PHP中编写Word文档，可能是一个不错的解决方案。您所要做的就是以正确的格式创建一个XML文档。我相信Word 2003和2007都支持WordM

不使用COM对象就可以在PHP中读写Word（2003和2007）文件吗？我知道我可以：

$file = fopen('c:\file.doc', 'w+');
fwrite($file, $text);
fclose();

但是Word会将其作为HTML文件而不是本机.doc文件来读取。

我不知道如何在PHP中读取本机Word文档，但是如果您想在PHP中编写Word文档，可能是一个不错的解决方案。您所要做的就是以正确的格式创建一个XML文档。我相信Word 2003和2007都支持WordML。

如果没有COM，您很可能无法阅读Word文档

Office 2007涵盖了写作。docx应该是可能的，因为它是XML标准。Word 2003很可能需要COM阅读，即使是MS现在发布的标准，因为这些标准非常庞大。我还没有见过很多与之匹配的库。

2007也可能有点复杂

docx格式是一个zip文件，其中包含几个文件夹，其中包含用于格式化和其他内容的其他文件

将.docx文件重命名为.zip，您就会明白我的意思

因此，如果您可以在PHP的zip文件中工作，您应该走上了正确的道路。

我不知道您将使用它做什么，但我需要.doc支持搜索索引；我所做的是使用一个名为“catdoc”的小命令行工具；这会将Word文档的内容转换为纯文本，以便对其进行索引。如果您需要保留格式和内容，这不是您的工具。

这适用于vs

<?php



/*****************************************************************
This approach uses detection of NUL (chr(00)) and end line (chr(13))
to decide where the text is:
- divide the file contents up by chr(13)
- reject any slices containing a NUL
- stitch the rest together again
- clean up with a regular expression
*****************************************************************/

function parseWord($userDoc) 
{
    $fileHandle = fopen($userDoc, "r");
    $line = @fread($fileHandle, filesize($userDoc));   
    $lines = explode(chr(0x0D),$line);
    $outtext = "";
    foreach($lines as $thisline)
      {
        $pos = strpos($thisline, chr(0x00));
        if (($pos !== FALSE)||(strlen($thisline)==0))
          {
          } else {
            $outtext .= $thisline." ";
          }
      }
     $outtext = preg_replace("/[^a-zA-Z0-9\s\,\.\-\n\r\t@\/\_\(\)]/","",$outtext);
    return $outtext;
} 

$userDoc = "cv.doc";

$text = parseWord($userDoc);
echo $text;


?>

读取二进制Word文档需要根据已发布的文档格式规范创建解析器。我认为这不是真正可行的解决办法

您可以使用来读取和写入Word文件-这与2003和2007版本的Word兼容。阅读时，您必须确保Word文档以正确的格式保存（在Word 2007中称为Word 2003 XML文档）。编写时只需遵循公开可用的XML模式。我从未使用这种格式从PHP写出Office文档，但我使用它读取Excel工作表（自然保存为XML电子表格2003）并在网页上显示其数据。由于这些文件都是XML数据，所以在其中导航并找出如何提取所需的数据是没有问题的

另一个选项—仅限Word 2007的选项（如果Word 2003中未安装OpenXML文件格式）—是重新排序到。正如所指出的，DOCX文件格式只是一个包含XML文件的ZIP归档文件。关于OpenXML文件格式有很多参考资料，因此您应该能够了解如何读取所需的数据。我认为写作会更加复杂——这取决于你投入了多少时间

也许您可以看看哪个库能够使用OpenXML标准写入Excel 2007文件并从Excel 2007文件中读取。在尝试读取和写入OpenXML Word文档时，您可以了解所涉及的工作。

该.rtf格式是否适合您的需要。rtf可以很容易地转换为.doc格式或从.doc格式转换为.doc格式，但它是用纯文本编写的（嵌入了控制命令）。这就是我计划将我的应用程序与Word文档集成的方式。

phpLiveDocx是一个Zend Framework组件，可以在Linux、Windows和Mac上用PHP读写文档和DOCX文件

请访问项目网站：

您可以使用Antiword，它是一款免费的MS Word阅读器，适用于Linux和最流行的操作系统

$document_file = 'c:\file.doc';
$text_from_doc = shell_exec('/usr/local/bin/antiword '.$document_file);

www.phplivedocx.org是一个基于SOAP的服务，这意味着您始终需要在线测试文件，但也没有足够的示例可供使用。奇怪的是，我在下载2天后才发现（也需要额外的zend framework），这是一个基于SOAP的程序（诅咒我！！！）…我认为没有COM，在Linux服务器上是不可能的，唯一的想法是在另一个可用的文件中更改doc文件，PHP可以解析该文件…

即使我正在处理同一类型的项目[唯一的文字处理器]！但是我选择了c#net和ASP.net。但是通过调查，我知道了这一点

使用OpenXMLSDK和VSTO[VisualStudioToolsforOffice]

我们可以轻松地使用word文件操作它们，甚至可以在内部将它们转换为不同的格式，如.odt、.pdf、.docx等

因此，转到msdn.microsoft.com，仔细了解office development选项卡。这是最简单的方法，因为我们需要实现的所有功能都已在.net中提供！！

但是当你想用PHP做你的项目时，你可以在Visual Studio和.net中做，因为PHP也是一种兼容.net的语言！！

我也有同样的情况我想我会使用一个便宜的50兆windows主机和免费域名来转换我的文件，用于PHP服务器。链接它们很容易。您只需创建一个ASP.NET页面，通过post接收文档文件，并通过HTTP进行回复所以简单的CURL就可以了。

只需更新代码

<?php

/*****************************************************************
This approach uses detection of NUL (chr(00)) and end line (chr(13))
to decide where the text is:
- divide the file contents up by chr(13)
- reject any slices containing a NUL
- stitch the rest together again
- clean up with a regular expression
*****************************************************************/

function parseWord($userDoc) 
{
    $fileHandle = fopen($userDoc, "r");
    $word_text = @fread($fileHandle, filesize($userDoc));
    $line = "";
    $tam = filesize($userDoc);
    $nulos = 0;
    $caracteres = 0;
    for($i=1536; $i<$tam; $i++)
    {
        $line .= $word_text[$i];

        if( $word_text[$i] == 0)
        {
            $nulos++;
        }
        else
        {
            $nulos=0;
            $caracteres++;
        }

        if( $nulos>1996)
        {   
            break;  
        }
    }

    //echo $caracteres;

    $lines = explode(chr(0x0D),$line);
    //$outtext = "<pre>";

    $outtext = "";
    foreach($lines as $thisline)
    {
        $tam = strlen($thisline);
        if( !$tam )
        {
            continue;
        }

        $new_line = ""; 
        for($i=0; $i<$tam; $i++)
        {
            $onechar = $thisline[$i];
            if( $onechar > chr(240) )
            {
                continue;
            }

            if( $onechar >= chr(0x20) )
            {
                $caracteres++;
                $new_line .= $onechar;
            }

            if( $onechar == chr(0x14) )
            {
                $new_line .= "</a>";
            }

            if( $onechar == chr(0x07) )
            {
                $new_line .= "\t";
                if( isset($thisline[$i+1]) )
                {
                    if( $thisline[$i+1] == chr(0x07) )
                    {
                        $new_line .= "\n";
                    }
                }
            }
        }
        //troca por hiperlink
        $new_line = str_replace("HYPERLINK" ,"<a href=",$new_line); 
        $new_line = str_replace("\o" ,">",$new_line); 
        $new_line .= "\n";

        //link de imagens
        $new_line = str_replace("INCLUDEPICTURE" ,"<br><img src=",$new_line); 
        $new_line = str_replace("\*" ,"><br>",$new_line); 
        $new_line = str_replace("MERGEFORMATINET" ,"",$new_line); 


        $outtext .= nl2br($new_line);
    }

 return $outtext;
} 

$userDoc = "custo.doc";
$userDoc = "Cultura.doc";
$text = parseWord($userDoc);

echo $text;


?>

使用PHP操作Word文件的一种方法是借助PHPDocX，您可能会发现这种方法很有趣。你可以看看它的工作原理。您可以插入或提取内容，甚至可以将多个Word文件合并为一个单独的Word文件。

直接使用下面的类来阅读word文档

类DocxConversion{
私有$filename；
公共函数构造（$filePath）{
$this->filename=$filePath；
}
私有函数read_doc（）{
$fileHandle=fopen（$this->filename，“r”）；
$line=@fread（$fileHandle，filesize（$this->filename））；
$lines=分解（chr（0x0D），$line）；
$outtext=“”；
foreach（$行作为$thisline）
{
$pos=strpos（$thisline，chr（0x00））；
如果（$pos！==FALSE）| |（str）
class DocxConversion{
    private $filename;

    public function __construct($filePath) {
        $this->filename = $filePath;
    }

    private function read_doc() {
        $fileHandle = fopen($this->filename, "r");
        $line = @fread($fileHandle, filesize($this->filename));   
        $lines = explode(chr(0x0D),$line);
        $outtext = "";
        foreach($lines as $thisline)
          {
            $pos = strpos($thisline, chr(0x00));
            if (($pos !== FALSE)||(strlen($thisline)==0))
              {
              } else {
                $outtext .= $thisline." ";
              }
          }
         $outtext = preg_replace("/[^a-zA-Z0-9\s\,\.\-\n\r\t@\/\_\(\)]/","",$outtext);
        return $outtext;
    }

    private function read_docx(){

        $striped_content = '';
        $content = '';

        $zip = zip_open($this->filename);

        if (!$zip || is_numeric($zip)) return false;

        while ($zip_entry = zip_read($zip)) {

            if (zip_entry_open($zip, $zip_entry) == FALSE) continue;

            if (zip_entry_name($zip_entry) != "word/document.xml") continue;

            $content .= zip_entry_read($zip_entry, zip_entry_filesize($zip_entry));

            zip_entry_close($zip_entry);
        }// end while

        zip_close($zip);

        $content = str_replace('</w:r></w:p></w:tc><w:tc>', " ", $content);
        $content = str_replace('</w:r></w:p>', "\r\n", $content);
        $striped_content = strip_tags($content);

        return $striped_content;
    }

 /************************excel sheet************************************/

function xlsx_to_text($input_file){
    $xml_filename = "xl/sharedStrings.xml"; //content file name
    $zip_handle = new ZipArchive;
    $output_text = "";
    if(true === $zip_handle->open($input_file)){
        if(($xml_index = $zip_handle->locateName($xml_filename)) !== false){
            $xml_datas = $zip_handle->getFromIndex($xml_index);
            $xml_handle = DOMDocument::loadXML($xml_datas, LIBXML_NOENT | LIBXML_XINCLUDE | LIBXML_NOERROR | LIBXML_NOWARNING);
            $output_text = strip_tags($xml_handle->saveXML());
        }else{
            $output_text .="";
        }
        $zip_handle->close();
    }else{
    $output_text .="";
    }
    return $output_text;
}

/*************************power point files*****************************/
function pptx_to_text($input_file){
    $zip_handle = new ZipArchive;
    $output_text = "";
    if(true === $zip_handle->open($input_file)){
        $slide_number = 1; //loop through slide files
        while(($xml_index = $zip_handle->locateName("ppt/slides/slide".$slide_number.".xml")) !== false){
            $xml_datas = $zip_handle->getFromIndex($xml_index);
            $xml_handle = DOMDocument::loadXML($xml_datas, LIBXML_NOENT | LIBXML_XINCLUDE | LIBXML_NOERROR | LIBXML_NOWARNING);
            $output_text .= strip_tags($xml_handle->saveXML());
            $slide_number++;
        }
        if($slide_number == 1){
            $output_text .="";
        }
        $zip_handle->close();
    }else{
    $output_text .="";
    }
    return $output_text;
}


    public function convertToText() {

        if(isset($this->filename) && !file_exists($this->filename)) {
            return "File Not exists";
        }

        $fileArray = pathinfo($this->filename);
        $file_ext  = $fileArray['extension'];
        if($file_ext == "doc" || $file_ext == "docx" || $file_ext == "xlsx" || $file_ext == "pptx")
        {
            if($file_ext == "doc") {
                return $this->read_doc();
            } elseif($file_ext == "docx") {
                return $this->read_docx();
            } elseif($file_ext == "xlsx") {
                return $this->xlsx_to_text();
            }elseif($file_ext == "pptx") {
                return $this->pptx_to_text();
            }
        } else {
            return "Invalid File Type";
        }
    }

}

$docObj = new DocxConversion("test.docx"); //replace your document name with correct extension doc or docx 
echo $docText= $docObj->convertToText();