Warning: file_get_contents(/data/phpspider/zhask/data//catemap/8/python-3.x/16.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/heroku/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 3.x 在PYTHON文件夹中将多个PDF转换为txt_Python 3.x - Fatal编程技术网

Python 3.x 在PYTHON文件夹中将多个PDF转换为txt

Python 3.x 在PYTHON文件夹中将多个PDF转换为txt,python-3.x,Python 3.x,我尝试了以下代码,但它仅转换文件夹中的最后一个pdf: import fitz # this is pymupdf import glob, os os.chdir('C:/Users/XXXXXXX') pdfs = [] for file in glob.glob("*.pdf"): with fitz.open(file) as doc: text = "" for page in doc: text += pag

我尝试了以下代码,但它仅转换文件夹中的最后一个pdf:

import fitz  # this is pymupdf
import glob, os
os.chdir('C:/Users/XXXXXXX')
pdfs = []
for file in glob.glob("*.pdf"):
 with fitz.open(file) as doc:
    text = ""
    for page in doc:
        text += page.getText()
textfile = open('textfile.txt', 'w',encoding="utf-8")
textfile.write(text)
你能帮我吗


我正在使用python 3.8

如果问题是您的循环不起作用(很可能是这样),您可以使用
os.walk(“start\u dir”)
。例如:

for path, dirs, files in os.walk('.'):  # All files.
    for file in files:  # Loop through each file.
        with fitz.open(file) as doc:  # Open file.
            ... 

您需要告诉
getText
要获取什么。然后将该文本附加到循环外部的列表中,这样它就不会被覆盖。最后,将该列表转换为字符串

编辑:我已经修改了我的原始答案,按照你的要求去做。为了将它们写入单个
.txt
文件,您需要将文件写入循环中。在移动到下一个pdf之前,不要忘记关闭
textfile
,否则它将不会写入以下文件

import fitz
import glob, os

DIR = '\\pdftext\\'
os.chdir(DIR + 'pdf\\')

def listToString(s):  
    str1 = ""  
    for ele in s:  
        str1 += ele   
    return str1  

for file in glob.glob("*.pdf"):
    print(file)
    filename = os.path.splitext(file)
    filename = filename[0]
    pdfs = []

    with fitz.open(file) as doc:
        text = ""
        for page in doc:
            text += page.getText(text)
            pdfs.append(text)
        
        textfile = open(DIR + 'text\\' + filename + '.txt', 'w',encoding="utf-8")
    pages = listToString(pdfs)
    textfile.write(pages)
    textfile.close()
我试过:

import sys, fitz
import glob
for fname in glob.glob("C:/Users/XXXXXX/*.pdf"):

doc = fitz.open(fname) # open document
out = open(fname + ".txt", "wb") # open text output
for page in doc: # iterate the document pages
    text = page.getText().encode("utf8") # get plain text (is in UTF-8)
    out.write(text) # write text of page
    out.write(bytes((12,))) # write page delimiter (form feed 0x0C)
 out.close()

它可以工作,但我仍然需要测试结果:-)

谢谢你的回复!!!它遍历文件夹并转换PDF,将内容合并到一个txt(textfile.txt)中。我需要和PDF一样多的txt文件(尽可能保持相同的名称)是的,你必须在每个文档后保存文本,因此如果你把你的代码保存到
for
循环中,应该可以了。谢谢你的回复!!!它遍历文件夹并转换PDF,将内容合并到一个txt(textfile.txt)中。我需要和PDF一样多的txt文件(尽可能保持相同的名称),你每次迭代都初始化
text=”“
。。。