Python Pdfminer Unicode文本未正确提取
我试图从unicode编码的pdf文件中提取僧伽罗语文本。 但文本提取不正确 PDF中的文本为: කවුරුහරි ඔබෙන් ඇහුවොත් 提取的文本是: කවුරුහරි ඔබෙන් ඇහුබ ොත් 一封信被另一封信替换。 这是我使用的代码Python Pdfminer Unicode文本未正确提取,python,unicode,utf-8,pdfminer,Python,Unicode,Utf 8,Pdfminer,我试图从unicode编码的pdf文件中提取僧伽罗语文本。 但文本提取不正确 PDF中的文本为: කවුරුහරි ඔබෙන් ඇහුවොත් 提取的文本是: කවුරුහරි ඔබෙන් ඇහුබ ොත් 一封信被另一封信替换。 这是我使用的代码 import io from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter from pdfminer.converter import TextConverter
import io
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
def convert_pdf_to_txt(path):
rsrcmgr = PDFResourceManager()
retstr = io.StringIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
fp = open(path, 'rb')
interpreter = PDFPageInterpreter(rsrcmgr, device)
caching = True
pagenos = set()
for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password, caching=caching, check_extractable=True):
interpreter.process_page(page)
text = retstr.getvalue()
fp.close()
device.close()
retstr.close()
return text
u = convert_pdf_to_txt('sinhala.pdf')
print(u)