Google api 如何阅读内容类型为application/msword和application/pdf等的Google Drive文档内容？_Google Api_Google Drive Api_Java Io

Google api 如何阅读内容类型为application/msword和application/pdf等的Google Drive文档内容？

google-api google-drive-api

Google api 如何阅读内容类型为application/msword和application/pdf等的Google Drive文档内容？,google-api,google-drive-api,java-io,Google Api,Google Drive Api,Java Io,我可以从内容类型为text/plain的文件中获取内容，但不能从内容类型为application/msword和application/pdf的文件中获取内容有什么方法可以得到内容并正确阅读吗？下面是与内容类型完美配合的代码：text/plain HttpResponse resp = service.getRequestFactory() .buildGetRequest(new GenericUrl(file.getDownloadUrl())).exe

我可以从内容类型为

text/plain

的文件中获取内容，但不能从内容类型为

application/msword

和

application/pdf

的文件中获取内容

有什么方法可以得到内容并正确阅读吗？下面是与内容类型完美配合的代码：

text/plain

HttpResponse resp = service.getRequestFactory()
                  .buildGetRequest(new GenericUrl(file.getDownloadUrl())).execute();

BufferedReader output = new BufferedReader(new InputStreamReader(resp.getContent()));
System.out.println("Shorten Response: ");
for (String line = output.readLine(); line != null; line = output.readLine()) {
    System.out.println(line);
}

我相信PDF和MSWORD格式都是二进制流，因此不能逐行读取。尝试将它们读入字节[]缓冲区

com.google.api.services.drive.Drive svc;
InputStream is = svc.getRequestFactory()
.buildGetRequest(new GenericUrl("xxx")).execute().getContent();

public byte[] strm2Bytes(InputStream is) {
    ByteArrayOutputStream byteBuffer = new ByteArrayOutputStream();
    byte[] buffer = new byte[2048];
    BufferedInputStream bufIS = null;
    if (is != null) try {
      bufIS = new BufferedInputStream(is);
      int cnt = 0;
      while ((cnt = bufIS.read(buffer)) >= 0) {
        byteBuffer.write(buffer, 0, cnt);
      }
    } catch (Exception e) {}
    finally { try { if (bufIS != null) bufIS.close(); } catch (IOException e) {}} 
    return byteBuffer.toByteArray();
  }

但是你会得到一个原始文件字节，我真的不知道你想用它做什么。转换展示？通常，这些字节缓冲区可以交给“解码器”（字读取器、pdf读取器、jpeg解码器等）。但同样，这些读卡器/解码器通常直接接受InputStream，因此不需要对它们进行字节缓冲

我使用了tika解析器，在我的例子中，它可以工作。请检查代码段：-

            HttpResponse resp = service.getRequestFactory().
            buildGetRequest(new GenericUrl(file.getDownloadUrl())).execute();

            Detector detector = new DefaultDetector();
            Parser parser = new AutoDetectParser(detector);
            Metadata metadata = new Metadata();
            InputStream input = TikaInputStream.get(resp.getContent());
            ContentHandler handler2 = new BodyContentHandler(
                    Integer.MAX_VALUE);
            parser.parse(input, handler2, metadata, new ParseContext());
            String text = handler2.toString();

我用了tika-app-1.3.jar。它正在处理.pdf、.doc.docx、.text等文件。

谢谢大家的回复。

我想用内容为这些数据编制索引。不起作用，我只想从文档中获取文本数据。我需要在solr中添加它。在.text文件中，我获取了所有文本数据，但在.pdf.doc中，我获取了一些流数据。我想将其转换为文本或可读的字符串。我想显示文本数据。现在请告诉我如何以可读的形式显示这个输入流？我已经回答了你的问题“如何阅读”。如何显示-呈现上面的MIME类型是另一个我不熟悉的问题。我处理JPEG图像，在我的情况下，我将缓冲区交给JPEG解码器并显示位图。正如我在上面的回答中提到的，您必须为您下载的原始数据找到一个“解码器/演示器”。