使用PHP DOM文档从docx读取图像和文本_Php

使用PHP DOM文档从docx读取图像和文本

php

使用PHP DOM文档从docx读取图像和文本,php,Php,我有一个Word docx文件。它有一个包含行和列的表。我已经创建了一个PHP代码，当我在运行PHP文件后检查时，它以XML格式获取。我想printf在我的php页面上打印文本和图像下面是php文件使用DOM文档读取word文本时的外观：这是我的Word文件-we.tl/t-tGddnyasKj 到目前为止的Php代码： <?php #extract.php function pre( $data=false, $header=false, $tag='h1' )

我有一个Word docx文件。它有一个包含

行

和

列

的表。我已经创建了一个PHP代码，当我在运行PHP文件后检查时，它以XML格式获取。我想

printf

在我的php页面上打印文本和图像

下面是php文件使用DOM文档读取word文本时的外观：

这是我的Word文件-we.tl/t-tGddnyasKj

到目前为止的Php代码：

<?php  

    #extract.php
   function pre( $data=false, $header=false, $tag='h1' ){
        $title = $header ? sprintf('<'.$tag.'>%s</'.$tag.'>',$header) : '';
        printf('%s<pre>%s</pre>',$title,print_r($data,1));
    }


    $document = 'sample.docx';


    function process_word_docx( $filename ){
        $zip = new ZipArchive;
        if( true === $zip->open( $filename ) ) {
            for( $i=0; $i < $zip->numFiles; $i++ ) {
                $obj=(object)$zip->statIndex( $i );
                if( $obj->name=='word/document.xml' ){
                    $xml=$zip->getFromIndex( $i );

                    libxml_use_internal_errors( true );
                    $dom=new DOMDocument('1.0','utf-8');
                    $dom->validateOnParse=false;
                    $dom->recover=true;
                    $dom->strictErrorChecking=false;
                    $dom->loadXML( $xml );
                    libxml_clear_errors();

                    $xp=new DOMXPath( $dom );
                    $xp->registerNamespace('ve','http://schemas.openxmlformats.org/markup-compatibility/2006');
                    $xp->registerNamespace('r','http://schemas.openxmlformats.org/officeDocument/2006/relationships');
                    $xp->registerNamespace('m','http://schemas.openxmlformats.org/officeDocument/2006/math');
                    $xp->registerNamespace('wp','http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing');
                    $xp->registerNamespace('w','http://schemas.openxmlformats.org/wordprocessingml/2006/main');

                    pre( $xml );

                }
            }
        }
    }
    process_word_docx( $document );


?>

如何在PHP页面上打印文本和图像？

根据昨天的扩展讨论，我继续修补该函数，并得出以下结论-您应该能够从中提取所需的信息。您声明需要保存图像，并将图像的名称和其他文本写入数据库-生成的数组具有相关内容

我发现，像这样将

docx

文件加载到DOMDocument实例中会导致DOM解析器对名称空间属性做一些奇怪的事情——它们以小写形式出现，但不能以小写形式查询，除非整个xml字符串以小写形式呈现。另一种方法是为XPath查询中使用的各种标记费劲地找到正确的用例

Hii，谢谢，我现在可以清楚地看到这一点，当我运行上述代码时，它没有在word\media\..中生成图像，你注意到代码了吗？它应该在数组中给出与生成图像时相同的名称，这样就有了一个连接。这可以构成保存图像的脚本的基础（如何保存，请参见前面的答案）而名称差异是word~的一个因素，不确定/为什么我需要使用旧脚本中的代码来生成图像？但是，我该如何关联，保存的图像与获取的名称不同，如果此脚本可以生成并关联，那么我就可以了。我们需要将生成的图像与数组中的名称关联起来，请引导并带我回家。Word似乎为图像指定了通用名称-我不知道为什么/如何工作。

docx

父文件中有几个文件，其中一个文件描述了主xml/word文档中元素之间的关系。检查每一个人并不是一件小事——一旦能够理解这种关系，就应该很容易了

<?php   

    define('br','<br />');

    #extract.php
    function pre( $data=false, $header=false, $tag='h1' ){
        $title = $header ? sprintf('<'.$tag.'>%s</'.$tag.'>',$header) : '';
        printf('%s<pre>%s</pre>',$title,print_r($data,1)); 
    }


    $document = 'sample.docx';



    function getparent( $n, $tag ){
        while( $n && $n->nodeType==XML_ELEMENT_NODE && $n->tagName!=$tag ){
            $n=$n->parentNode;
        }
        return $n;
    }


    function process_word_docx( $filename ){
        $data=[ 'start' => microtime( true ),'names'=>[] ];
        $paths=[];

        $zip=new ZipArchive;


        if( true === $zip->open( $filename ) ) {
            for( $i=0; $i < $zip->numFiles; $i++ ) {
                $obj=(object)$zip->statIndex( $i );

                if( $obj->name=='word/document.xml' ){

                    $xml=$zip->getFromIndex( $i );

                    $data['position']=$obj->index;
                    $data['xml-size']=$obj->size;
                    $data['created']=$obj->mtime;
                    $data['compression-method']=$obj->comp_method;



                    libxml_use_internal_errors( true );
                    $dom=new DOMDocument('1.0','utf-8');
                    $dom->validateOnParse=false;
                    $dom->recover=true;
                    $dom->strictErrorChecking=false;
                    $dom->loadXML( strtolower( $xml ) );
                    libxml_clear_errors();



                    $xp=new DOMXPath( $dom );
                    /* none of these namespace uris exist */
                    $xp->registerNamespace('ve','http://schemas.openxmlformats.org/markup-compatibility/2006');
                    $xp->registerNamespace('r','http://schemas.openxmlformats.org/officeDocument/2006/relationships');
                    $xp->registerNamespace('m','http://schemas.openxmlformats.org/officeDocument/2006/math');
                    $xp->registerNamespace('wp','http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing');
                    $xp->registerNamespace('w','http://schemas.openxmlformats.org/wordprocessingml/2006/main');
                    $xp->registerNamespace('pic','http://schemas.openxmlformats.org/drawingml/2006/picture');
                    $xp->registerNamespace('a','http://schemas.openxmlformats.org/drawingml/2006/main');
                    /* this/these exist */
                    $xp->registerNamespace('wne','http://schemas.microsoft.com/office/word/2006/wordml');


                    /* find tables */
                    $col=$xp->query( '//w:tbl//w:tr' );
                    if( $col->length > 0 ){
                        foreach( $col as $row => $tr ){

                            /* Count the cells on each row */
                            $expr='count( w:tc )';
                            $cellcount=$xp->evaluate( $expr, $tr );

                            if( $cellcount > 0 ){
                                /* find all the table cells for this row */
                                $expr='w:tc';
                                $cells=$xp->query( $expr, $tr );

                                /* Are there any images in this row */
                                $expr='count(//pic:cnvpr)';
                                $qty=$xp->evaluate( $expr, $tr );


                                /* There are images */
                                if( $qty > 0 ){

                                    $expr='w:tc//w:drawing//a:graphic//pic:pic//pic:nvpicpr/pic:cnvpr';
                                    $wpcol=$xp->query( $expr, $tr );

                                    if( $wpcol->length > 0 ){
                                        foreach( $wpcol as $index=> $node ){
                                            /* navigate up the DOM tree until we find the table cell tag */
                                            $oCell = getparent( $node, 'w:tc' );

                                            /* Find the name of the image */
                                            $name = $node->getAttribute('name');
                                            $data['names'][]=$name;

                                            /* get the text in the current row */
                                            $text = ucfirst( $tr->textContent ) ?: ' - EMPTY -';

                                            /* find the cell index for the row */
                                            foreach( $cells as $index => $cell ){
                                                if( $cell === $oCell )break;
                                            }

                                            /* prepare payload */
                                            if( $wpcol->length==1 ){
                                                $data['statistics'][ $row ]=array(
                                                    'text'      =>  $text,
                                                    'name'      =>  $name,
                                                    'column'    =>  $index
                                                );
                                            } else {
                                                /* multiple images on this row - multiple images can be within a single cell */
                                                $data['statistics'][ $row ][ $index ][]=array(
                                                    'text'      =>  $text,
                                                    'name'      =>  $name,
                                                    'column'    =>  $index
                                                );
                                            }
                                        }
                                    }
                                }
                            }
                        }
                    }
                }else{
                    if( preg_match( '([^\s]+(\.(?i)(jpg|jpeg|png|gif|bmp))$)', $obj->name ) ) {
                        $paths[ $obj->name ]=base64_encode( $zip->getFromIndex( $i ) );
                    }
                }
            }
        }
        /* finalise statistics */
        $data['count']=$qty;
        $data['end']=microtime( true );
        $data['time']=$data['end'] - $data['start'];
        $data['total-size']=filesize( $filename );
        $data['paths']=$paths;

        /* return payload */
        return $data;
    }



    $data=process_word_docx( $document );
    pre( $data );

?>