无法使用PHP和DOM从Word DOCX提取邮件合并字段_Php_Xml_Ms Word_Docx

无法使用PHP和DOM从Word DOCX提取邮件合并字段

php xml ms-word

无法使用PHP和DOM从Word DOCX提取邮件合并字段,php,xml,ms-word,docx,Php,Xml,Ms Word,Docx,我正试图想出一个解决方案，允许用户上传一个支持邮件合并的Word DOCX模板文件。理想情况下，系统将读取DOCX文件，提取XML，找到邮件合并字段，并将它们保存到数据库中，以便将来进行映射。我可能会使用诸如Zend LiveDocX或PHPDOCX之类的SOAP服务，或者完全其他的服务，但是现在我需要弄清楚如何识别DOCX文件中的字段。为此，我从本文开始：我已经对它进行了一些调整，以适应我的需要，这可能是一个问题，尽管我在原始代码中也遇到了同样的错误。具体来说，我现在不使用它来执行邮件合并，

我正试图想出一个解决方案，允许用户上传一个支持邮件合并的Word DOCX模板文件。理想情况下，系统将读取DOCX文件，提取XML，找到邮件合并字段，并将它们保存到数据库中，以便将来进行映射。我可能会使用诸如Zend LiveDocX或PHPDOCX之类的SOAP服务，或者完全其他的服务，但是现在我需要弄清楚如何识别DOCX文件中的字段。为此，我从本文开始：

我已经对它进行了一些调整，以适应我的需要，这可能是一个问题，尽管我在原始代码中也遇到了同样的错误。具体来说，我现在不使用它来执行邮件合并，我只想标识字段。以下是我得到的：

    $newFile = '/var/www/mysite.com/public_html/template.docx';

    $zip = new ZipArchive();
    if( $zip->open( $newFile, ZIPARCHIVE::CHECKCONS ) !== TRUE ) { echo 'failed to open template'; exit; }
    $file = 'word/document.xml';
    $data = $zip->getFromName( $file );
    $zip->close();

    $doc = new DOMDocument();
    $doc->loadXML( $data );
    $wts = $doc->getElementsByTagNameNS('http://schemas.openxmlformats.org/wordprocessingml/2006/main', 'fldChar');

    $mergefields = array();

    function getMailMerge(&$wts, $index) {
        $loop = true;
        $counter = $index;
        $startfield = false;
        while ($loop) {
            if ($wts->item($counter)->attributes->item(0)->nodeName == 'w:fldCharType') {
                $nodeName = '';
                $nodeValue = '';
                switch ($wts->item($counter)->attributes->item(0)->nodeValue) {
                    case 'begin':
                        if ($startfield) {
                            $counter = getMailMerge($wts, $counter);
                        }
                        $startfield = true;
                        if ($wts->item($counter)->parentNode->nextSibling) {
                            $nodeName = $wts->item($counter)->parentNode->nextSibling->childNodes->item(1)->nodeName;
                            $nodeValue = $wts->item($counter)->parentNode->nextSibling->childNodes->item(1)->nodeValue;
                        }
                        else {
                            // No sibling
                            // check next node
                            $nodeName = $wts->item($counter + 1)->parentNode->previousSibling->childNodes->item(1)->nodeName;
                            $nodeValue = $wts->item($counter + 1)->parentNode->previousSibling->childNodes->item(1)->nodeValue;
                        }
                        if (substr($nodeValue, 0, 11) == ' MERGEFIELD') {
                            $mergefields[] = strtolower(str_replace('"', '', trim(substr($nodeValue, 12))));
                        }
                        $counter++;
                    break;
                case 'separate':
                    $counter++;
                    break;
                case 'end':
                    if ($startfield) {
                        $startfield = false;
                    }
                    $loop = false;
                }
            }
        }
        return $counter;
    }

    for ($x = 0; $x < $wts->length; $x++) {
        if ($wts->item($x)->attributes->item(0)->nodeName == 'w:fldCharType' && $wts->item($x)->attributes->item(0)->nodeValue == 'begin') {
            $newcount = getMailMerge($wts, $x);
            $x = $newcount;
        }
    }

谷歌在试图找出这个错误时让我失望了，有人能给我指出正确的方向吗？提前谢谢

找到了一个解决方案——它没有我所希望的那么优雅，但现在开始了

使用xml解析器创建，我可以在DOCX文件中搜索我需要的键，特别是HTTP://SCHEMAS.OPENXMLFORMATS.ORG/WORDPROCESSINGML/2006/MAIN:instrext，它标识标记为MERGEFIELD的所有字段。然后我可以将结果转储到数组中，并使用它们更新数据库。也就是说：

    // Word file to be opened
    $newFile = '/var/www/mysite.com/public_html/template.docx';

    // Extract the document.xml file from the DOCX archive
    $zip = new ZipArchive();
    if( $zip->open( $newFile, ZIPARCHIVE::CHECKCONS ) !== TRUE ) { echo 'failed to open template'; exit; }
    $file = 'word/document.xml';
    $data = $zip->getFromName( $file );
    $zip->close();

    // Create the XML parser and create an array of the results
    $parser = xml_parser_create_ns();
    xml_parse_into_struct($parser, $data, $vals, $index);
    xml_parser_free($parser);

    // Cycle the index array looking for the important key and save those items to another array
    foreach ($index as $key => $indexitem) {
        if ($key == 'HTTP://SCHEMAS.OPENXMLFORMATS.ORG/WORDPROCESSINGML/2006/MAIN:INSTRTEXT') {
            $found = $indexitem;
            break;
        }
    }

    // Cycle *that* array looking for "MERGEFIELD" and grab the field name to yet another array
    // Make sure to check for duplicates since fields may be re-used
    if ($found) {
        $mergefields = array();
        foreach ($found as $field) {
            if (!in_array(strtolower(trim(substr($vals[$field]['value'], 12))), $mergefields)) {
                $mergefields[] = strtolower(trim(substr($vals[$field]['value'], 12)));
            }
        }
    }

    // View the fruits of your labor
    print_r($mergefields);

使用相同的脚本，发现它必须包含许多子节点$nodeName=$wts->item$counter->parentNode->nextSibling->nodeName；

    // Word file to be opened
    $newFile = '/var/www/mysite.com/public_html/template.docx';

    // Extract the document.xml file from the DOCX archive
    $zip = new ZipArchive();
    if( $zip->open( $newFile, ZIPARCHIVE::CHECKCONS ) !== TRUE ) { echo 'failed to open template'; exit; }
    $file = 'word/document.xml';
    $data = $zip->getFromName( $file );
    $zip->close();

    // Create the XML parser and create an array of the results
    $parser = xml_parser_create_ns();
    xml_parse_into_struct($parser, $data, $vals, $index);
    xml_parser_free($parser);

    // Cycle the index array looking for the important key and save those items to another array
    foreach ($index as $key => $indexitem) {
        if ($key == 'HTTP://SCHEMAS.OPENXMLFORMATS.ORG/WORDPROCESSINGML/2006/MAIN:INSTRTEXT') {
            $found = $indexitem;
            break;
        }
    }

    // Cycle *that* array looking for "MERGEFIELD" and grab the field name to yet another array
    // Make sure to check for duplicates since fields may be re-used
    if ($found) {
        $mergefields = array();
        foreach ($found as $field) {
            if (!in_array(strtolower(trim(substr($vals[$field]['value'], 12))), $mergefields)) {
                $mergefields[] = strtolower(trim(substr($vals[$field]['value'], 12)));
            }
        }
    }

    // View the fruits of your labor
    print_r($mergefields);