C# 使用openXML将docx/doc第一个页眉和页脚导出为docx文件_C#_Html_Ms Word_Openxml

C# 使用openXML将docx/doc第一个页眉和页脚导出为docx文件

c# html ms-word

C# 使用openXML将docx/doc第一个页眉和页脚导出为docx文件,c#,html,ms-word,openxml,C#,Html,Ms Word,Openxml,我想问如何转换MS Word文档（doc/docx）的页眉/页脚部分到HTML。我打开文档的方式如下： using (WordprocessingDocument wDoc = WordprocessingDocument.Open(memoryStream, true)) a、 k.a OpenXML 我正在使用WmlToHtmlConverter转换文档，它将文档转换得非常好，只是页眉和页脚被跳过，因为html标准不支持分页。我想知道如何获取它们并将它们提取为html。我试着让他们喜

我想问如何转换MS Word文档（doc/docx）的页眉/页脚部分到HTML。我打开文档的方式如下：

using (WordprocessingDocument wDoc = WordprocessingDocument.Open(memoryStream, true))

a、 k.a OpenXML

我正在使用

WmlToHtmlConverter

转换文档，它将文档转换得非常好，只是页眉和页脚被跳过，因为html标准不支持分页。我想知道如何获取它们并将它们提取为html。我试着让他们喜欢：

using (WordprocessingDocument wdDoc = WordprocessingDocument.Open(mainFileMemoryStream, true))
{
    Document mainPart = wdDoc.MainDocumentPart.Document;
    DocumentFormat.OpenXml.Packaging.HeaderPart firstHeader =
            wdDoc.MainDocumentPart.HeaderParts.FirstOrDefault();

    if (firstHeader != null)
    {
        using (var headerStream = firstHeader.GetStream())
        {
            return headerStream.ReadFully();
        }
    }
    return null;
}

然后将其传递给转换函数，但我得到一个异常，它说：

文件包含损坏的数据，带有堆栈跟踪：

at System.IO.Packaging.ZipPackage..ctor(Stream s, FileMode packageFileMode, FileAccess packageFileAccess)
at System.IO.Packaging.Package.Open(Stream stream, FileMode packageMode, FileAccess packageAccess)
at DocumentFormat.OpenXml.Packaging.OpenXmlPackage.OpenCore(Stream stream, Boolean readWriteMode)
at DocumentFormat.OpenXml.Packaging.WordprocessingDocument.Open(Stream stream, Boolean isEditable, OpenSettings openSettings)
at DocumentFormat.OpenXml.Packaging.WordprocessingDocument.Open(Stream stream, Boolean isEditable)
at DocxToHTML.Converter.HTMLConverter.ParseDOCX(Byte[] fileInfo, String fileName) in D:\eTemida\eTemida.Web\DocxToHTML.Converter\HTMLConverter.cs:line 96

任何帮助都将不胜感激

经过不懈的努力，我找到了以下解决方案：

我创建了一个函数，用于将docx文档的字节数组转换为Html，如下所示

public string ConvertToHtml(byte[] fileInfo, string fileName = "Default.docx")
    {
        if (string.IsNullOrEmpty(fileName) || Path.GetExtension(fileName) != ".docx")
            return "Unsupported format";

        //FileInfo fileInfo = new FileInfo(fullFilePath);

        string htmlText = string.Empty;
        try
        {
            htmlText = ParseDOCX(fileInfo, fileName);
        }
        catch (OpenXmlPackageException e)
        {

            if (e.ToString().Contains("Invalid Hyperlink"))
            {
                using (MemoryStream fs = new MemoryStream(fileInfo))
                {
                    UriFixer.FixInvalidUri(fs, brokenUri => FixUri(brokenUri));
                }
                htmlText = ParseDOCX(fileInfo, fileName);
            }
        }
        return htmlText;
    }

其中，ParseDOCX执行所有转换。ParseDOCX的代码：

private string ParseDOCX(byte[] fileInfo, string fileName)
    {
        try
        {
            //byte[] byteArray = File.ReadAllBytes(fileInfo.FullName);
            using (MemoryStream memoryStream = new MemoryStream())
            {
                memoryStream.Write(fileInfo, 0, fileInfo.Length);

                using (WordprocessingDocument wDoc = WordprocessingDocument.Open(memoryStream, true))
                {

                    int imageCounter = 0;

                    var pageTitle = fileName;
                    var part = wDoc.CoreFilePropertiesPart;
                    if (part != null)
                        pageTitle = (string)part.GetXDocument().Descendants(DC.title).FirstOrDefault() ?? fileName;

                    WmlToHtmlConverterSettings settings = new WmlToHtmlConverterSettings()
                    {
                        AdditionalCss = "body { margin: 1cm auto; max-width: 20cm; padding: 0; }",
                        PageTitle = pageTitle,
                        FabricateCssClasses = true,
                        CssClassPrefix = "pt-",
                        RestrictToSupportedLanguages = false,
                        RestrictToSupportedNumberingFormats = false,
                        ImageHandler = imageInfo =>
                        {
                            ++imageCounter;
                            string extension = imageInfo.ContentType.Split('/')[1].ToLower();
                            ImageFormat imageFormat = null;
                            if (extension == "png") imageFormat = ImageFormat.Png;
                            else if (extension == "gif") imageFormat = ImageFormat.Gif;
                            else if (extension == "bmp") imageFormat = ImageFormat.Bmp;
                            else if (extension == "jpeg") imageFormat = ImageFormat.Jpeg;
                            else if (extension == "tiff")
                            {
                                extension = "gif";
                                imageFormat = ImageFormat.Gif;
                            }
                            else if (extension == "x-wmf")
                            {
                                extension = "wmf";
                                imageFormat = ImageFormat.Wmf;
                            }

                            if (imageFormat == null)
                                return null;

                            string base64 = null;
                            try
                            {
                                using (MemoryStream ms = new MemoryStream())
                                {
                                    imageInfo.Bitmap.Save(ms, imageFormat);
                                    var ba = ms.ToArray();
                                    base64 = System.Convert.ToBase64String(ba);
                                }
                            }
                            catch (System.Runtime.InteropServices.ExternalException)
                            { return null; }


                            ImageFormat format = imageInfo.Bitmap.RawFormat;
                            ImageCodecInfo codec = ImageCodecInfo.GetImageDecoders().First(c => c.FormatID == format.Guid);
                            string mimeType = codec.MimeType;

                            string imageSource = string.Format("data:{0};base64,{1}", mimeType, base64);

                            XElement img = new XElement(Xhtml.img,
                                new XAttribute(NoNamespace.src, imageSource),
                                imageInfo.ImgStyleAttribute,
                                imageInfo.AltText != null ?
                                    new XAttribute(NoNamespace.alt, imageInfo.AltText) : null);
                            return img;
                        }

                    };
                    XElement htmlElement = WmlToHtmlConverter.ConvertToHtml(wDoc, settings);

                    var html = new XDocument(new XDocumentType("html", null, null, null), htmlElement);
                    var htmlString = html.ToString(SaveOptions.DisableFormatting);
                    return htmlString;
                }
            }
        }
        catch (Exception)
        {
            return "File contains corrupt data";
        }
    }

到目前为止，一切看起来都很好和简单，但后来我意识到文档的页眉和页脚只是被跳过了，所以我不得不以某种方式转换它们。我尝试使用HeaderPart的

GetStream（）

方法，但当然抛出了异常，因为头树与文档的头树不同

然后，我决定使用openXML的

WordprocessingDocument headerDoc=WordprocessingDocument.Create（headerStream，Document）

将页眉和页脚提取为新文档（很难处理），但不幸的是，该文档的转换也没有成功，因为这只是创建一个普通的docx文档，没有任何设置、样式、网站设置等。这花了很多时间才弄清楚

所以最后我决定通过Cathal的DocX库创建一个新文档，它最终实现了。代码如下：

public string ConvertHeaderToHtml(HeaderPart header)
    {

        using (MemoryStream headerStream = new MemoryStream())
        {
            //Cathal's Docx Create
            var newDocument = Novacode.DocX.Create(headerStream);
            newDocument.Save();

            using (WordprocessingDocument headerDoc = WordprocessingDocument.Open(headerStream,true))
            {
                var headerParagraphs = new List<OpenXmlElement>(header.Header.Elements());
                var mainPart = headerDoc.MainDocumentPart;

                //Cloning the List is necesery because it will throw exception for the reason
                // that you are working with refferences of the Elements
                mainPart.Document.Body.Append(headerParagraphs.Select(h => (OpenXmlElement)h.Clone()).ToList());

                //Copies the Header RelationShips as Document's
                foreach (IdPartPair parts in header.Parts)
                {
                    //Very important second parameter of AddPart, if not set the relationship ID is being changed
                    // and the wordDocument pictures, etc. wont show
                    mainPart.AddPart(parts.OpenXmlPart,parts.RelationshipId);
                }
                headerDoc.MainDocumentPart.Document.Save();
                headerDoc.Save();
                headerDoc.Close();
            }
            return ConvertToHtml(headerStream.ToArray());
        }
    }

公共字符串转换器headerToHTML（HeaderPart头）
{
使用（MemoryStream headerStream=新的MemoryStream（））
{
//国泰的Docx创建
var newDocument=Novacode.DocX.Create（headerStream）；
newDocument.Save（）；
使用（WordprocessingDocument headerDoc=WordprocessingDocument.Open（headerStream，true））
{
var headerParagraphs=新列表（header.header.Elements（））；
var mainPart=headerDoc.MainDocumentPart；
//克隆列表是必要的，因为它会因此引发异常
//您正在使用图元的参照
mainPart.Document.Body.Append（headerParagraphs.Select（h=>（openxmlement）h.Clone（））.ToList（））；
//将标题关系复制为文档的
foreach（标题中的IdPartPair部件。部件）
{
//AddPart的第二个非常重要的参数，如果未设置，则正在更改关系ID
//而wordDocument图片等不会显示
mainPart.AddPart（parts.OpenXmlPart，parts.RelationshipId）；
}
headerDoc.MainDocumentPart.Document.Save（）；
headerDoc.Save（）；
headerDoc.Close（）；
}
返回ConvertToHtml（headerStream.ToArray（））；
}
}

这就是头球。我正在传递HeaderPart，然后获取其标题元素。提取关系，如果标题中有图像，这一点非常重要，然后将它们导入文档本身，文档就可以进行转换了

使用相同的步骤从页脚生成Html

希望这能对他的工作有所帮助。

您好，在OpenXML中没有直接的方法可以将页眉和页脚作为HTML（即在OpenXML powertools中），而不是您必须将页眉和页脚内容作为文本读取，然后您必须为该页眉文本应用样式。请参考：我有用3个html字符串（HtmlBody、HtmlHeader、HtmlFooter）创建Word文档的代码。那里也有一些基石，如果需要的话，我会努力上传。