C# 无OCR的结构化数据PDF_提取_C#_Automation_Automationanywhere

C# 无OCR的结构化数据PDF_提取

c# automation

C# 无OCR的结构化数据PDF_提取,c#,automation,automationanywhere,C#,Automation,Automationanywhere,我一直在尝试使用C#从pdf文件中提取数据，包括一次性表格。我的目标是在不使用任何第三方库及其许可或OCR的情况下提取这些数据，同时提取数据而不丢失其结构。我需要这些数据来创建用于pdf自动化的DLL using System; using System.Collections.Generic; using System.Linq; using System.Text; using System.Threading.Tasks; using Word = Microsoft.Office.Int

我一直在尝试使用C#从pdf文件中提取数据，包括一次性表格。我的目标是在不使用任何第三方库及其许可或OCR的情况下提取这些数据，同时提取数据而不丢失其结构。我需要这些数据来创建用于pdf自动化的DLL

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using Word = Microsoft.Office.Interop.Word;
using System.IO;

namespace PDF_EXTRACT
{

    public class pdfTohtm
    {

        public static string ConvertPdf(string path, string outpath)
        {
            Word.Application app = new Word.Application(); ;
            Word.Document doc1;
            try
            {

                doc1 = app.Documents.Open(path, false, ReadOnly: false);
                app.DisplayAlerts = Word.WdAlertLevel.wdAlertsAll;
                app.FileValidation = Microsoft.Office.Core.MsoFileValidationMode.msoFileValidationSkip;
                app.Visible = false;
                app.AutomationSecurity = Microsoft.Office.Core.MsoAutomationSecurity.msoAutomationSecurityForceDisable;
                doc1.SaveAs2(outpath, Word.WdSaveFormat.wdFormatFilteredHTML, ReadOnlyRecommended: false);
                doc1.Close();
                string result = File.ReadAllText(outpath + ".htm", Encoding.UTF7);
                return "success:" + result;
            }
            catch (Exception e)
            {


                return "failed::::" + e;


            }
            finally
            {
                app.Quit();
                System.Runtime.InteropServices.Marshal.FinalReleaseComObject(app);
            }

        }
    }



}

说明：此解决方案的工作原理是将pdf作为可编辑的word文档打开，然后将文件另存为
.htm文件。现在.htm文件被打开并作为文本文件读取，因此该代码的输出是一组html代码，您可以将其粘贴到Excel中，将其转换为pdf格式的Excel格式，而不会丢失数据结构

主要注意事项：

如果pdf是扫描副本，则此解决方案不起作用，例如
根据我对
主题 2.对于参数“path”，文件的完整路径必须为已传递，对于参数“outpath”，不带扩展名例如：C:\Users\username\folder\filename（没有扩展名）文件（即“.htm”必填项）

主要注意事项：

如果pdf是扫描副本，则此解决方案不起作用，例如
根据我对
主题 2.对于参数“path”，文件的完整路径必须为已传递，对于参数“outpath”，不带扩展名例如：C:\Users\username\folder\filename（没有扩展名）

文件i.e.“.htm”是必需的。

我认为，实现这一点的最佳方法是使用名为iTextSharp的库。它很容易作为Nuget包提供

以下是一个例子：

using System;
using System.IO;
using System.Linq;
using System.Text;
using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;

namespace Pdf2Text
{
    class Program
    {
        static void Main(string[] args)
        {
            if (!args.Any()) return;

            var file = args[0];
            var output = Path.ChangeExtension(file, ".txt");
            if (!File.Exists(file)) return;

            var bytes = File.ReadAllBytes(file);
            File.WriteAllText(output, ConvertToText(bytes), Encoding.UTF8);
        }

        private static string ConvertToText(byte[] bytes)
        {
            var sb = new StringBuilder();

            try
            {
                var reader = new PdfReader(bytes);
                var numberOfPages = reader.NumberOfPages;

                for (var currentPageIndex = 1; currentPageIndex <= numberOfPages; currentPageIndex++)
                {
                    sb.Append(PdfTextExtractor.GetTextFromPage(reader, currentPageIndex));
                }
            }
            catch (Exception exception)
            {
                Console.WriteLine(exception.Message);
            }

            return sb.ToString();
        }
    }
}

使用系统；
使用System.IO；
使用System.Linq；
使用系统文本；
使用iTextSharp.text.pdf；
使用iTextSharp.text.pdf.parser；
名称空间Pdf2Text
{
班级计划
{
静态void Main（字符串[]参数）
{
如果（！args.Any（））返回；
var file=args[0]；
var output=Path.ChangeExtension（文件“.txt”）；
如果（！File.Exists（File））返回；
var bytes=File.ReadAllBytes（文件）；
File.WriteAllText（输出，ConvertToText（字节），Encoding.UTF8）；
}
专用静态字符串ConvertText（字节[]字节）
{
var sb=新的StringBuilder（）；
尝试
{
变量读取器=新的PdfReader（字节）；
var numberOfPages=reader.numberOfPages；
对于（var currentPageIndex=1；currentPageIndex我认为，实现这一点的最佳方法是使用一个名为iTextSharp的库。它可以作为Nuget包轻松获得
以下是一个例子：
using System;
using System.IO;
using System.Linq;
using System.Text;
using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;

namespace Pdf2Text
{
    class Program
    {
        static void Main(string[] args)
        {
            if (!args.Any()) return;

            var file = args[0];
            var output = Path.ChangeExtension(file, ".txt");
            if (!File.Exists(file)) return;

            var bytes = File.ReadAllBytes(file);
            File.WriteAllText(output, ConvertToText(bytes), Encoding.UTF8);
        }

        private static string ConvertToText(byte[] bytes)
        {
            var sb = new StringBuilder();

            try
            {
                var reader = new PdfReader(bytes);
                var numberOfPages = reader.NumberOfPages;

                for (var currentPageIndex = 1; currentPageIndex <= numberOfPages; currentPageIndex++)
                {
                    sb.Append(PdfTextExtractor.GetTextFromPage(reader, currentPageIndex));
                }
            }
            catch (Exception exception)
            {
                Console.WriteLine(exception.Message);
            }

            return sb.ToString();
        }
    }
}

使用系统；
使用System.IO；
使用System.Linq；
使用系统文本；
使用iTextSharp.text.pdf；
使用iTextSharp.text.pdf.parser；
名称空间Pdf2Text
{
班级计划
{
静态void Main（字符串[]参数）
{
如果（！args.Any（））返回；
var file=args[0]；
var output=Path.ChangeExtension（文件“.txt”）；
如果（！File.Exists（File））返回；
var bytes=File.ReadAllBytes（文件）；
File.WriteAllText（输出，ConvertToText（字节），Encoding.UTF8）；
}
专用静态字符串ConvertText（字节[]字节）
{
var sb=新的StringBuilder（）；
尝试
{
变量读取器=新的PdfReader（字节）；
var numberOfPages=reader.numberOfPages；
对于（var currentPageIndex=1；currentPageIndex可以在商业上免费使用。是的，我为多个客户端使用过它。它在Nuget发行版下免费提供。iTextsharp可以在商业上免费使用吗？是的，我为多个客户端使用过它。它在Nuget发行版下免费提供。