C# 如何在C语言中从MS office文档中提取文本#_C#_Ms Office_Text Extraction

C# 如何在C语言中从MS office文档中提取文本#

c# ms-office

C# 如何在C语言中从MS office文档中提取文本#,c#,ms-office,text-extraction,C#,Ms Office,Text Extraction,我试图用C#从MS Word（.doc、.docx）、Excel和Powerpoint中提取文本（字符串）。在哪里可以找到免费且简单的.Net库来阅读MS Office文档？我尝试使用NPOI，但没有得到关于如何使用NPOI的示例。我做过一次docx文本提取器，它非常简单。基本上，docx和我认为的其他（新）格式是一个包含大量XML文件的zip文件。可以使用XmlReader并仅使用.NET类提取文本我不再有代码了，似乎：（，但我找到一个有类似代码的人如果您需要读取.doc和.xls文件，

我试图用C#从MS Word（.doc、.docx）、Excel和Powerpoint中提取文本（字符串）。在哪里可以找到免费且简单的.Net库来阅读MS Office文档？

我尝试使用NPOI，但没有得到关于如何使用NPOI的示例。

我做过一次docx文本提取器，它非常简单。基本上，docx和我认为的其他（新）格式是一个包含大量XML文件的zip文件。可以使用XmlReader并仅使用.NET类提取文本

我不再有代码了，似乎：（，但我找到一个有类似代码的人

如果您需要读取.doc和.xls文件，那么这可能不可行，因为它们是二进制格式，可能更难解析

还有微软发布的CTP版本。

简单

这两个步骤将帮助您实现目标：

1）使用将文档转换为DOCX
2）用于从新DOCX中提取文本

1）的链接很好地解释了如何进行转换，甚至还有一个代码示例

2）的另一种方法是用C#解压DOCX文件并扫描所需的文件。您可以阅读ZIP文件的结构

编辑：啊，是的，我忘了指出，正如Skurmedel在下面指出的那样，您必须在要进行转换的系统上安装Office。

使用PInvokes，您可以使用该界面（在Windows上）。许多常见文件类型的IFilter都是随Windows安装的（您可以使用工具浏览它们。您只需要求IFilter返回文件中的文本即可。有多组示例代码（就是这样一个示例）。

适用于Microsoft Word 2007和Microsoft Word 2010（.docx）文件您可以使用Open XML SDK。此代码片段将打开文档并以文本形式返回其内容。它对于尝试使用正则表达式解析Word文档内容的任何人都特别有用。要使用此解决方案，您需要引用DocumentFormat.OpenXml.dll，它是OpenXml SDK的一部分

见：

让我稍微更正一下KyleM给出的答案。我刚刚添加了两个额外节点的处理，这会影响结果：一个负责带“\t”的水平制表，另一个负责带“\v”的垂直制表。代码如下：

    public static string ReadAllTextFromDocx(FileInfo fileInfo)
    {
        StringBuilder stringBuilder;
        using(WordprocessingDocument wordprocessingDocument = WordprocessingDocument.Open(dataSourceFileInfo.FullName, false))
        {
            NameTable nameTable = new NameTable();
            XmlNamespaceManager xmlNamespaceManager = new XmlNamespaceManager(nameTable);
            xmlNamespaceManager.AddNamespace("w", "http://schemas.openxmlformats.org/wordprocessingml/2006/main");

            string wordprocessingDocumentText;
            using(StreamReader streamReader = new StreamReader(wordprocessingDocument.MainDocumentPart.GetStream()))
            {
                wordprocessingDocumentText = streamReader.ReadToEnd();
            }

            stringBuilder = new StringBuilder(wordprocessingDocumentText.Length);

            XmlDocument xmlDocument = new XmlDocument(nameTable);
            xmlDocument.LoadXml(wordprocessingDocumentText);

            XmlNodeList paragraphNodes = xmlDocument.SelectNodes("//w:p", xmlNamespaceManager);
            foreach(XmlNode paragraphNode in paragraphNodes)
            {
                XmlNodeList textNodes = paragraphNode.SelectNodes(".//w:t | .//w:tab | .//w:br", xmlNamespaceManager);
                foreach(XmlNode textNode in textNodes)
                {
                    switch(textNode.Name)
                    {
                        case "w:t":
                            stringBuilder.Append(textNode.InnerText);
                            break;

                        case "w:tab":
                            stringBuilder.Append("\t");
                            break;

                        case "w:br":
                            stringBuilder.Append("\v");
                            break;
                    }
                }

                stringBuilder.Append(Environment.NewLine);
            }
        }

        return stringBuilder.ToString();
    }

Tika非常有用，可以轻松地从不同类型的文档（包括microsoft office文件）中提取文本

你可以使用这个项目，这是凯文·米勒制作的一件很好的艺术品

只需添加这个NuGet包

然后，这一行代码将发挥神奇的作用：

var text = new TikaOnDotNet.TextExtractor().Extract("fileName.docx  / pdf  / .... ").Text;

虽然参加派对有点晚，但是-现在你不需要下载任何东西-都已经安装了.NET：（只需确保添加对System.IO.Compression和System.IO.Compression.FileSystem的引用）

使用MicrosoftOfficeInterop。它是免费的，很流畅。下面是我如何从一个文档中提取所有单词的

    using Microsoft.Office.Interop.Word;

   //Create Doc
    string docPath = @"C:\docLocation.doc";
    Application app = new Application();
    Document doc = app.Documents.Open(docPath);

    //Get all words
    string allWords = doc.Content.Text;
    doc.Close();
    app.Quit();

然后用这些词做任何你想做的事。

如果你正在寻找asp.net选项，除非你在服务器上安装office，否则互操作将无法工作。即使如此，Microsoft也表示不这样做

我使用了Spire.Doc，它工作得很好。它甚至可以读取真正是.txt但保存了.Doc的文档。它们有免费和付费版本。你还可以获得一个试用许可证，该许可证可以删除你创建的文档中的一些警告，但我没有创建任何警告，只需搜索它们，免费版本就像一个符咒一样工作。

一个合适的版本在C#is API中从Office文档中提取文本的选项。以下是用于提取简单文本和格式化文本的代码示例

提取文本

提取格式化文本

披露：我在GroupDocs担任开发人员宣传员。

Office interop库唯一令人伤心的部分是您需要安装Office。

interop

是可用的，但如果可能，应该避免使用。Microsoft Word 12.0对象库-->这不在“添加引用”右键单击的“添加引用”列表中。是否有其他方法必须输入Microsoft Word 12.0对象库，这样我才能读取Word文档。互操作在godaddy主机中不起作用。GoDay不支持Office。有趣…一个非常狡猾的解决方案：）不太可能。这是Windows上索引服务使用的机制，我认为桌面搜索也使用它。我用它来索引PDF（通过安装Adobe IFilter-），所有类型的Office文档（这些文档的IFilter随Windows一起安装）和其他几种文件类型。当它工作时，它工作得很好。偶尔，你从IFilter上没有收到任何回复，也没有理由解释原因。我使用了pInvoke，发现它非常棒。要从任何文档中提取文本，我们所要做的就是确保机器上安装了适当的IFilter（或下载并安装）。我喜欢这篇文章和示例表单代码项目，看看这篇针对MS Office 2007的文章这里是MS Office 2007过滤器包是的，只要你安装了PDF iFilter。您可以通过安装Acrobat Reader（iFilter随附）或单独安装iFilter（）来完成此操作。[注：其他PDF iFilter可用：）]2 quick Qs-a）我目前正在使用此处概述的方法从PDF中提取文本。使用iFilter会有什么不同？b）在链接的IFilter方法中，作者执行一个：TextReader=newFilterReader（文件名）；我正在ASP.NET中使用FileUpload控件，无法获取文件名的路径，因为出于安全考虑，服务器端未公开该路径。我只能使用服务器端的fileUpload控件执行以下操作：Stream str=fileUpload1.FileContent；字节b[]=fileUpload1.FileBytes；这真是太棒了！我已经用过docx了，剩下的呢？你可以“连接”到一个xslx文件，就像它是一个带有ODCB的数据库一样。我认为这是一个相当麻烦的解决方案。我没有我

using System;
using System.Linq;
using System.Xml.Linq;
using System.Xml.XPath;
using System.Xml;
using System.Text;
using System.IO.Compression;

public static class DocxTextExtractor
{
    public static string Extract(string filename)
    {
        XmlNamespaceManager NsMgr = new XmlNamespaceManager(new NameTable());
        NsMgr.AddNamespace("w", "http://schemas.openxmlformats.org/wordprocessingml/2006/main");

        using (var archive = ZipFile.OpenRead(filename))
        {
            return XDocument
                .Load(archive.GetEntry(@"word/document.xml").Open())
                .XPathSelectElements("//w:p", NsMgr)
                .Aggregate(new StringBuilder(), (sb, p) => p
                    .XPathSelectElements(".//w:t|.//w:tab|.//w:br", NsMgr)
                    .Select(e => { switch (e.Name.LocalName) { case "br": return "\v"; case "tab": return "\t"; } return e.Value; })
                    .Aggregate(sb, (sb1, v) => sb1.Append(v)))
                .ToString();
        }
    }
}

    using Microsoft.Office.Interop.Word;

   //Create Doc
    string docPath = @"C:\docLocation.doc";
    Application app = new Application();
    Document doc = app.Documents.Open(docPath);

    //Get all words
    string allWords = doc.Content.Text;
    doc.Close();
    app.Quit();

// Create an instance of Parser class
using(Parser parser = new Parser("sample.docx"))
{
    // Extract a text into the reader
    using(TextReader reader = parser.GetText())
    {
        // Print a text from the document
        // If text extraction isn't supported, a reader is null
        Console.WriteLine(reader == null ? "Text extraction isn't supported" : reader.ReadToEnd());
    }
}

// Create an instance of Parser class
using (Parser parser = new Parser("sample.docx"))
{
    // Extract a formatted text into the reader
    using (TextReader reader = parser.GetFormattedText(new FormattedTextOptions(FormattedTextMode.Html)))
    {
        // Print a formatted text from the document
        // If formatted text extraction isn't supported, a reader is null
        Console.WriteLine(reader == null ? "Formatted text extraction isn't suppported" : reader.ReadToEnd());
    }
}