用php读取MS-word文件
我阅读MS word文件时,首先将其转换为zip,然后获取其XML 但它删除了新行字符,这让我很烦恼。我该怎么办 我使用以下代码:用php读取MS-word文件,php,xml,xml-parsing,ms-word,Php,Xml,Xml Parsing,Ms Word,我阅读MS word文件时,首先将其转换为zip,然后获取其XML 但它删除了新行字符,这让我很烦恼。我该怎么办 我使用以下代码: function get_docx_content($filename) { //Check for extension $ext = end(explode('.', $filename)); //if its docx file if($ext == 'docx') $dataFile = "word/document
function get_docx_content($filename) {
//Check for extension
$ext = end(explode('.', $filename));
//if its docx file
if($ext == 'docx')
$dataFile = "word/document.xml";
//else it must be odt file
else
$dataFile = "content.xml";
//Create a new ZIP archive object
$zip = new ZipArchive;
// Open the archive file
if (true === $zip->open($filename)) {
// If successful, search for the data file in the archive
if (($index = $zip->locateName($dataFile)) !== false) {
// Index found! Now read it to a string
$text = $zip->getFromIndex($index);
// Load XML from a string
// Ignore errors and warnings
$xml = DOMDocument::loadXML($text, LIBXML_NOENT | LIBXML_XINCLUDE | LIBXML_NOERROR | LIBXML_NOWARNING);
// Remove XML formatting tags and return the text
return strip_tags($xml->saveXML());
}
//Close the archive file
$zip->close();
}
}
如果对你有帮助,试试这个
<?php
/*****************************************************************
This approach uses detection of NUL (chr(00)) and end line (chr(13))
to decide where the text is:
- divide the file contents up by chr(13)
- reject any slices containing a NUL
- stitch the rest together again
- clean up with a regular expression
*****************************************************************/
function parseWord($userDoc)
{
$fileHandle = fopen($userDoc, "r");
$line = @fread($fileHandle, filesize($userDoc));
$lines = explode(chr(0x0D),$line);
$outtext = "";
foreach($lines as $thisline)
{
$pos = strpos($thisline, chr(0x00));
if (($pos !== FALSE)||(strlen($thisline)==0))
{
} else {
$outtext .= $thisline." ";
}
}
$outtext = preg_replace("/[^a-zA-Z0-9\s\,\.\-\n\r\t@\/\_\(\)]/","",$outtext);
return $outtext;
}
$userDoc = "cv.doc";
$text = parseWord($userDoc);
echo $text;
?>
如果你想要更多的学习,那就调查一下
不,不幸的是,它导致了数百个未知字符的混乱,而我只有12个语义词。