使用Google Apps脚本从多页附件PDF提取文本_Pdf_Google Apps Script_Text_Blob_Ocr

使用Google Apps脚本从多页附件PDF提取文本

pdf google-apps-script text

使用Google Apps脚本从多页附件PDF提取文本,pdf,google-apps-script,text,blob,ocr,Pdf,Google Apps Script,Text,Blob,Ocr,我有一个带有多个扫描页面的Gmail附件PDF。当我使用Google Apps脚本将附件中的blob保存到驱动器文件时，从Google Drive手动打开PDF，然后选择“使用Google文档打开”，PDF中的所有文本都显示为Google文档。但是，当我使用OCR将blob保存为Google文档时，只有第一页图像中的文本保存到文档中，可以手动访问，也可以通过代码访问获取blob并对其进行处理的代码如下： function getAttachments(desiredLabel, process

我有一个带有多个扫描页面的Gmail附件PDF。当我使用Google Apps脚本将附件中的blob保存到驱动器文件时，从Google Drive手动打开PDF，然后选择“使用Google文档打开”，PDF中的所有文本都显示为Google文档。但是，当我使用OCR将blob保存为Google文档时，只有第一页图像中的文本保存到文档中，可以手动访问，也可以通过代码访问

获取blob并对其进行处理的代码如下：

function getAttachments(desiredLabel, processedLabel, emailQuery){
    // Find emails
    var threads = GmailApp.search(emailQuery);
    if(threads.length > 0){
        // Iterate through the emails
        for(var i in threads){
            var mesgs = threads[i].getMessages();
            for(var j in mesgs){
                var processingMesg = mesgs[j];
                var attachments = processingMesg.getAttachments();
                var processedAttachments = 0;
                // Iterate through attachments
                for(var k in attachments){
                    var attachment = attachments[k];
                    var attachmentName = attachment.getName();
                    var attachmentType = attachment.getContentType();
                    // Process PDFs
                    if (attachmentType.includes('pdf')) {
                        processedAttachments += 1;
                        var pdfBlob = attachment.copyBlob();
                        var filename = attachmentName + " " + processedAttachments;
                        processPDF(pdfBlob, filename);
                    }
                }
            }
        }
    }
}


function processPDF(pdfBlob, filename){
  // Saves the blob as a PDF.
  // All pages are displayed if I click on it from Google Drive after running this script.
  let pdfFile = DriveApp.createFile(pdfBlob);
  pdfFile.setName(filename);
  // Saves the blob as an OCRed Doc.
  let resources = {
    title: filename,
    mimeType: "application/pdf"
  };
  let options = {
    ocr: true,
    ocrLanguage: "en"
  };
  let file = Drive.Files.insert(resources, pdfBlob, options);
  let fileID = file.getId();
  // Open the file to get the text.
  // Only the text of the image on the first page is available in the Doc.
  let doc = DocumentApp.openById(fileID);
  let docText = doc.getBody().getText();
}

如果我尝试使用Google Docs在没有OCR的情况下直接读取PDF，我会得到异常：无效参数，例如：

DocumentApp.openById(pdfFile.getId());

如何从PDF的所有页面获取文本？

```
DocumentApp.openById
```
是一种只能用于Google文档的方法

pdfFile

只能使用-

DriveApp.getFileById（pdfFile.getId（））来“打开”


使用DriveApp
打开文件允许您使用文件上的方法

说到OCR转换，您的代码可以正确地将PDF文档的所有页面转换为Google文档，因此您的错误源可能来自附件本身/检索blob的方式

请注意，OCR转换不善于保留格式，因此两页PDF可能会被压缩为一页文档，这与PDF的格式有关

能否提供获取pdfBlob的代码？我添加了代码以获取附件。我只需要文本。格式不相关。blob拥有所有的数据。我成功地获得了一个二进制结构，该结构显示了带有页面的二进制结构，但似乎无法访问，例如，如果有某种方法可以通过页面访问PDF，那么这可能会起作用。这不是编码问题。我最终发现，在检查输出时，我看到的参考文档是错误的。当我尝试另一个多页PDF时，它读取了所有文本。NIC错误-不在计算机中。