Python pypdf用于PDF列表_Python_Pypdf_Pdftotext

Python pypdf用于PDF列表

python

Python pypdf用于PDF列表,python,pypdf,pdftotext,Python,Pypdf,Pdftotext,我已经让pypdf对单个pdf文件正常工作，但我似乎无法让它对一小部分文件正常工作，或者在多个pdf的for循环中正常工作，因为字符串不可调用。我有什么想法可以作为解决办法吗 def getPDFContent(path): content = "" # Load PDF into pyPDF pdf = pyPdf.PdfFileReader(file(path, "rb")) # Iterate pages for i in range(0, pdf.

我已经让pypdf对单个pdf文件正常工作，但我似乎无法让它对一小部分文件正常工作，或者在多个pdf的for循环中正常工作，因为字符串不可调用。我有什么想法可以作为解决办法吗

def getPDFContent(path):
    content = ""
    # Load PDF into pyPDF
    pdf = pyPdf.PdfFileReader(file(path, "rb"))
    # Iterate pages
    for i in range(0, pdf.getNumPages()):
        # Extract text from page and add to content
        content += pdf.getPage(i).extractText() + "\n"
    # Collapse whitespace
    content = " ".join(content.replace(u"\xa0", " ").strip().split())
    return content

#print getPDFContent(r"Z:\GIS\MasterPermits\12300983.pdf").encode("ascii", "ignore")


#find pdfs            
for root, dirs, files in os.walk(folder1):
    for file in files:
      if file.endswith(('.pdf')):
          d=os.path.join(root, file)
          print getPDFContent(d).encode("ascii", "ignore")

Traceback (most recent call last):
  File "C:\Documents and Settings\dknight\Desktop\readpdf.py", line 50, in <module>
    print getPDFContent(d).encode("ascii", "ignore")
  File "C:\Documents and Settings\dknight\Desktop\readpdf.py", line 32, in getPDFContent
    pdf = pyPdf.PdfFileReader(file(path, "rb"))
TypeError: 'str' object is not callable

def getPDFContent（路径）：
content=“”
#将PDF加载到pyPDF中
pdf=pyPdf.PdfFileReader（文件（路径，“rb”））
#迭代页面
对于范围内的i（0，pdf.getNumPages（））：
#从页面中提取文本并添加到内容
content+=pdf.getPage（i）.extractText（）+“\n”
#折叠空白
content=“”.join（content.replace（u“\xa0”和“”）.strip（）.split（））
返回内容
#打印getPDFContent（r“Z:\GIS\MasterPermissions\12300983.pdf”）。编码（“ascii”，“忽略”）
#查找PDF
对于os.walk（folder1）中的根目录、目录和文件：
对于文件中的文件：
如果文件.endswith（“.pdf”）：
d=os.path.join（根目录，文件）
打印getPDFContent（d）。编码（“ascii”，“忽略”）
回溯（最近一次呼叫最后一次）：
文件“C:\Documents and Settings\dknight\Desktop\readpdf.py”，第50行，在
打印getPDFContent（d）。编码（“ascii”，“忽略”）
文件“C:\Documents and Settings\dknight\Desktop\readpdf.py”，第32行，在getPDFContent中
pdf=pyPdf.PdfFileReader（文件（路径，“rb”））
TypeError:“str”对象不可调用

我使用了一个列表，但我得到了完全相同的错误，我不认为这会是一个大问题，但现在它正在成为一个。我知道我能够在arcpy中解决类似的问题，但这一点也不接近

尽量不要对变量名使用内置类型：

不要这样做：

for file in files:

改为这样做：

 for myfile in files:

如果你能提供一个完整的程序，那会有帮助的。请将您的程序缩减为尽可能短的完整、可运行的程序，以演示问题，并将其粘贴到您的问题中。有关此调试技术的更多信息，请参阅。在调用

文件（路径，“rb”）

时，我怀疑

文件的含义与您认为的不同。尝试在调用失败之前立即添加打印类型（文件），文件。您是否在程序中的任何其他地方使用变量名文件
？