C# 使用openXML将docx/doc第一个页眉和页脚导出为docx文件
我想问如何转换MS Word文档(doc/docx)的页眉/页脚部分 到HTML。 我打开文档的方式如下:C# 使用openXML将docx/doc第一个页眉和页脚导出为docx文件,c#,html,ms-word,openxml,C#,Html,Ms Word,Openxml,我想问如何转换MS Word文档(doc/docx)的页眉/页脚部分 到HTML。 我打开文档的方式如下: using (WordprocessingDocument wDoc = WordprocessingDocument.Open(memoryStream, true)) a、 k.a OpenXML 我正在使用WmlToHtmlConverter转换文档,它将文档转换得非常好,只是页眉和页脚被跳过,因为html标准不支持分页。我想知道如何获取它们并将它们提取为html。 我试着让他们喜
using (WordprocessingDocument wDoc = WordprocessingDocument.Open(memoryStream, true))
a、 k.a OpenXML
我正在使用WmlToHtmlConverter
转换文档,它将文档转换得非常好,只是页眉和页脚被跳过,因为html标准不支持分页。我想知道如何获取它们并将它们提取为html。
我试着让他们喜欢:
using (WordprocessingDocument wdDoc = WordprocessingDocument.Open(mainFileMemoryStream, true))
{
Document mainPart = wdDoc.MainDocumentPart.Document;
DocumentFormat.OpenXml.Packaging.HeaderPart firstHeader =
wdDoc.MainDocumentPart.HeaderParts.FirstOrDefault();
if (firstHeader != null)
{
using (var headerStream = firstHeader.GetStream())
{
return headerStream.ReadFully();
}
}
return null;
}
然后将其传递给转换函数,但我得到一个异常,它说:
文件包含损坏的数据,带有堆栈跟踪:
at System.IO.Packaging.ZipPackage..ctor(Stream s, FileMode packageFileMode, FileAccess packageFileAccess)
at System.IO.Packaging.Package.Open(Stream stream, FileMode packageMode, FileAccess packageAccess)
at DocumentFormat.OpenXml.Packaging.OpenXmlPackage.OpenCore(Stream stream, Boolean readWriteMode)
at DocumentFormat.OpenXml.Packaging.WordprocessingDocument.Open(Stream stream, Boolean isEditable, OpenSettings openSettings)
at DocumentFormat.OpenXml.Packaging.WordprocessingDocument.Open(Stream stream, Boolean isEditable)
at DocxToHTML.Converter.HTMLConverter.ParseDOCX(Byte[] fileInfo, String fileName) in D:\eTemida\eTemida.Web\DocxToHTML.Converter\HTMLConverter.cs:line 96
任何帮助都将不胜感激经过不懈的努力,我找到了以下解决方案: 我创建了一个函数,用于将docx文档的字节数组转换为Html,如下所示
public string ConvertToHtml(byte[] fileInfo, string fileName = "Default.docx")
{
if (string.IsNullOrEmpty(fileName) || Path.GetExtension(fileName) != ".docx")
return "Unsupported format";
//FileInfo fileInfo = new FileInfo(fullFilePath);
string htmlText = string.Empty;
try
{
htmlText = ParseDOCX(fileInfo, fileName);
}
catch (OpenXmlPackageException e)
{
if (e.ToString().Contains("Invalid Hyperlink"))
{
using (MemoryStream fs = new MemoryStream(fileInfo))
{
UriFixer.FixInvalidUri(fs, brokenUri => FixUri(brokenUri));
}
htmlText = ParseDOCX(fileInfo, fileName);
}
}
return htmlText;
}
其中,ParseDOCX执行所有转换。ParseDOCX的代码:
private string ParseDOCX(byte[] fileInfo, string fileName)
{
try
{
//byte[] byteArray = File.ReadAllBytes(fileInfo.FullName);
using (MemoryStream memoryStream = new MemoryStream())
{
memoryStream.Write(fileInfo, 0, fileInfo.Length);
using (WordprocessingDocument wDoc = WordprocessingDocument.Open(memoryStream, true))
{
int imageCounter = 0;
var pageTitle = fileName;
var part = wDoc.CoreFilePropertiesPart;
if (part != null)
pageTitle = (string)part.GetXDocument().Descendants(DC.title).FirstOrDefault() ?? fileName;
WmlToHtmlConverterSettings settings = new WmlToHtmlConverterSettings()
{
AdditionalCss = "body { margin: 1cm auto; max-width: 20cm; padding: 0; }",
PageTitle = pageTitle,
FabricateCssClasses = true,
CssClassPrefix = "pt-",
RestrictToSupportedLanguages = false,
RestrictToSupportedNumberingFormats = false,
ImageHandler = imageInfo =>
{
++imageCounter;
string extension = imageInfo.ContentType.Split('/')[1].ToLower();
ImageFormat imageFormat = null;
if (extension == "png") imageFormat = ImageFormat.Png;
else if (extension == "gif") imageFormat = ImageFormat.Gif;
else if (extension == "bmp") imageFormat = ImageFormat.Bmp;
else if (extension == "jpeg") imageFormat = ImageFormat.Jpeg;
else if (extension == "tiff")
{
extension = "gif";
imageFormat = ImageFormat.Gif;
}
else if (extension == "x-wmf")
{
extension = "wmf";
imageFormat = ImageFormat.Wmf;
}
if (imageFormat == null)
return null;
string base64 = null;
try
{
using (MemoryStream ms = new MemoryStream())
{
imageInfo.Bitmap.Save(ms, imageFormat);
var ba = ms.ToArray();
base64 = System.Convert.ToBase64String(ba);
}
}
catch (System.Runtime.InteropServices.ExternalException)
{ return null; }
ImageFormat format = imageInfo.Bitmap.RawFormat;
ImageCodecInfo codec = ImageCodecInfo.GetImageDecoders().First(c => c.FormatID == format.Guid);
string mimeType = codec.MimeType;
string imageSource = string.Format("data:{0};base64,{1}", mimeType, base64);
XElement img = new XElement(Xhtml.img,
new XAttribute(NoNamespace.src, imageSource),
imageInfo.ImgStyleAttribute,
imageInfo.AltText != null ?
new XAttribute(NoNamespace.alt, imageInfo.AltText) : null);
return img;
}
};
XElement htmlElement = WmlToHtmlConverter.ConvertToHtml(wDoc, settings);
var html = new XDocument(new XDocumentType("html", null, null, null), htmlElement);
var htmlString = html.ToString(SaveOptions.DisableFormatting);
return htmlString;
}
}
}
catch (Exception)
{
return "File contains corrupt data";
}
}
到目前为止,一切看起来都很好和简单,但后来我意识到文档的页眉和页脚只是被跳过了,所以我不得不以某种方式转换它们。
我尝试使用HeaderPart的GetStream()
方法,但当然抛出了异常,因为头树与文档的头树不同
然后,我决定使用openXML的WordprocessingDocument headerDoc=WordprocessingDocument.Create(headerStream,Document)
将页眉和页脚提取为新文档(很难处理),但不幸的是,该文档的转换也没有成功,因为这只是创建一个普通的docx文档,没有任何设置、样式、网站设置等。这花了很多时间才弄清楚
所以最后我决定通过Cathal的DocX库创建一个新文档,它最终实现了。代码如下:
public string ConvertHeaderToHtml(HeaderPart header)
{
using (MemoryStream headerStream = new MemoryStream())
{
//Cathal's Docx Create
var newDocument = Novacode.DocX.Create(headerStream);
newDocument.Save();
using (WordprocessingDocument headerDoc = WordprocessingDocument.Open(headerStream,true))
{
var headerParagraphs = new List<OpenXmlElement>(header.Header.Elements());
var mainPart = headerDoc.MainDocumentPart;
//Cloning the List is necesery because it will throw exception for the reason
// that you are working with refferences of the Elements
mainPart.Document.Body.Append(headerParagraphs.Select(h => (OpenXmlElement)h.Clone()).ToList());
//Copies the Header RelationShips as Document's
foreach (IdPartPair parts in header.Parts)
{
//Very important second parameter of AddPart, if not set the relationship ID is being changed
// and the wordDocument pictures, etc. wont show
mainPart.AddPart(parts.OpenXmlPart,parts.RelationshipId);
}
headerDoc.MainDocumentPart.Document.Save();
headerDoc.Save();
headerDoc.Close();
}
return ConvertToHtml(headerStream.ToArray());
}
}
公共字符串转换器headerToHTML(HeaderPart头)
{
使用(MemoryStream headerStream=新的MemoryStream())
{
//国泰的Docx创建
var newDocument=Novacode.DocX.Create(headerStream);
newDocument.Save();
使用(WordprocessingDocument headerDoc=WordprocessingDocument.Open(headerStream,true))
{
var headerParagraphs=新列表(header.header.Elements());
var mainPart=headerDoc.MainDocumentPart;
//克隆列表是必要的,因为它会因此引发异常
//您正在使用图元的参照
mainPart.Document.Body.Append(headerParagraphs.Select(h=>(openxmlement)h.Clone()).ToList());
//将标题关系复制为文档的
foreach(标题中的IdPartPair部件。部件)
{
//AddPart的第二个非常重要的参数,如果未设置,则正在更改关系ID
//而wordDocument图片等不会显示
mainPart.AddPart(parts.OpenXmlPart,parts.RelationshipId);
}
headerDoc.MainDocumentPart.Document.Save();
headerDoc.Save();
headerDoc.Close();
}
返回ConvertToHtml(headerStream.ToArray());
}
}
这就是头球。我正在传递HeaderPart,然后获取其标题元素。提取关系,如果标题中有图像,这一点非常重要,然后将它们导入文档本身,文档就可以进行转换了
使用相同的步骤从页脚生成Html
希望这能对他的工作有所帮助。您好,在OpenXML中没有直接的方法可以将页眉和页脚作为HTML(即在OpenXML powertools中),而不是您必须将页眉和页脚内容作为文本读取,然后您必须为该页眉文本应用样式。请参考:我有用3个html字符串(HtmlBody、HtmlHeader、HtmlFooter)创建Word文档的代码。那里也有一些基石,如果需要的话,我会努力上传。