如何让这个Python方法返回字符串，而不是将其写入标准输出？_Python_Pdf_Return_Stdout_Pdfminer

如何让这个Python方法返回字符串，而不是将其写入标准输出？

python pdf

如何让这个Python方法返回字符串，而不是将其写入标准输出？,python,pdf,return,stdout,pdfminer,Python,Pdf,Return,Stdout,Pdfminer,我正在尝试使用Python从pdf中提取文本。对于这一点，我发现它做得相当好，使用如下方法： kramer65 $ pdf2txt.py myfile.pdf all the text contents of the pdf are printed out here.. >>> from my_pdf2txt import main >>> main(open('myfile.pdf', 'rb')) all the text contents of the

我正在尝试使用Python从pdf中提取文本。对于这一点，我发现它做得相当好，使用如下方法：

kramer65 $ pdf2txt.py myfile.pdf
all the text contents
of the pdf
are printed out here..

>>> from my_pdf2txt import main
>>> main(open('myfile.pdf', 'rb'))
all the text contents
of the pdf
are printed out here..

因为我想在我的程序中使用此功能，所以我想将其用作模块而不是命令行工具。因此，我设法将pdf2txt.py文件调整为以下内容：

#!/usr/bin/env python
import sys
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.converter import TextConverter
from pdfminer.cmapdb import CMapDB
from pdfminer.layout import LAParams

def main(fp):
    debug = 0
    pagenos = set()
    maxpages = 0
    imagewriter = None
    codec = 'utf-8'
    caching = True
    laparams = LAParams()

    PDFDocument.debug = debug
    PDFParser.debug = debug
    CMapDB.debug = debug
    PDFPageInterpreter.debug = debug

    resourceManager = PDFResourceManager(caching=caching)
    outfp = sys.stdout
    device = TextConverter(resourceManager, outfp, codec=codec, laparams=laparams, imagewriter=imagewriter)
    interpreter = PDFPageInterpreter(resourceManager, device)
    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, caching=caching, check_extractable=True):
        interpreter.process_page(page)
    fp.close()
    device.close()
    outfp.close()
    return  # Here I want to return the extracted text string

我现在可以将其称为一个模块，如下所示：

kramer65 $ pdf2txt.py myfile.pdf
all the text contents
of the pdf
are printed out here..

>>> from my_pdf2txt import main
>>> main(open('myfile.pdf', 'rb'))
all the text contents
of the pdf
are printed out here..

它当前使用

sys.stdout.write（）

打印结果字符串，但实际上我希望它使用代码最后一行的

return

语句返回这些字符串。但是，由于sys.stdout.write的使用隐藏得很深，我真的不知道如何让这个方法返回这些字符串，而不是将其写入stdout

有人知道我如何让这个方法返回找到的字符串，而不是将它们写入stdout吗？欢迎所有提示

正如达斯·科蒂克所建议的，您可以将

sys.stdout

指向您想要的任何类似文件的对象。然后，当您调用函数时，打印的数据将定向到您的对象，而不是屏幕。例如：

import sys
import StringIO

def frob():
    sys.stdout.write("Hello, how are you doing?")


#we want to call frob, storing its output in a temporary buffer.

#hold on to the old reference to stdout so we can restore it later.
old_stdout = sys.stdout

#create a temporary buffer object, and assign it to stdout
output_buffer = StringIO.StringIO()
sys.stdout = output_buffer

frob()

#retrieve the result.
result = output_buffer.getvalue()

#restore the old value of stdout.
sys.stdout = old_stdout

print "This is the result of frob: ", result

输出：

This is the result of frob:  Hello, how are you doing?

对于您的问题，您只需将

frob（）

调用替换为

main（fp）

问题是如何将输出作为字符串返回。如果这里有人想知道如何将输出直接写入文件，而不是打印在终端上。这里有一个对我有效的单线解决方案

只需添加一行：

sys.stdout=open("pdf_text.txt","w")

outfp = sys.stdout.

行前：

sys.stdout=open("pdf_text.txt","w")

outfp = sys.stdout.

希望这对某人有所帮助。

您可以使用

文件或StringIO
作为stdout
。因此，您可以捕获结果并返回它。您可以使用sys.stdout=sys.\uu stdout\uuuu
而不是sys.stdout=old\u stdout
。我觉得更漂亮