Python 3.x 问题：用Python抓取PDF，然后用utf-8编码_Python 3.x_Pdf_Unicode_Utf 8_Ascii

Python 3.x 问题：用Python抓取PDF，然后用utf-8编码

python-3.x pdf unicode utf-8

Python 3.x 问题：用Python抓取PDF，然后用utf-8编码,python-3.x,pdf,unicode,utf-8,ascii,Python 3.x,Pdf,Unicode,Utf 8,Ascii,我正在使用Python从web上抓取一些PDF，以便将它们转换为文本文件，以便在R中进行分析我正在使用pdfminer，然后用utf-8编码它们，但是完成的文本文件仍然包含大量字节对象的表示形式（例如“\xe2\x80\x94”），而不是所需的字符本身我的查询与此类似，区别在于我已经用utf-8编码了我的字节对象，并且仍然存在相同的问题我的代码如下： from pdfminer.converter import TextConverter from io import StringIO f

我正在使用Python从web上抓取一些PDF，以便将它们转换为文本文件，以便在R中进行分析

我正在使用pdfminer，然后用utf-8编码它们，但是完成的文本文件仍然包含大量字节对象的表示形式（例如“\xe2\x80\x94”），而不是所需的字符本身

我的查询与此类似，区别在于我已经用utf-8编码了我的字节对象，并且仍然存在相同的问题

我的代码如下：

from pdfminer.converter import TextConverter
from io import StringIO
from io import open
from urllib.request import urlopen

def readPDF(pdfile):
    rsrcmgr=PDFResourceManager()
    retstr=StringIO()
    laparams=LAParams()
    device=TextConverter(rsrcmgr,retstr,laparams=laparams)
    process_pdf(rsrcmgr,device,pdfFile)
    device.close()
    content=retstr.getvalue()
    retstr.close()
    return content`

pdfFile=urlopen(webaddress)
outputString=readPDF(pdfFile)
proceedings=outputString.encode('utf-8')
proceedings=str(proceedings)
file=open(filename,"w")
file.write(proceedings)
file.close()

如果这很简单，我道歉。我对Python很陌生

以下代码块是不必要的，可能会错误地编码和解码数据。请参阅内联注释

proceedings=outputString.encode('utf-8') # creates a UTF-8 byte object
proceedings=str(proceedings) # creates string representation <- the source of your issue
file=open(filename,"w") # encodes str to platform specific encoding.

提示。使用

with

语句设置文件上下文。这允许在上下文完成后关闭文件。另外，最好不要使用

文件

，因为它也是一种类型。将上述代码替换为：

with open(filename, 'w', encoding="utf-8") as proceedings_file:
    proceedings_file.write(proceedings)

文件

不再是Python 3中的类型。

str（会议记录）

实际上返回在前一行中创建的字节对象的字符串表示形式，例如，

“b'stuff'”

。hi@MarkTolonen。是的，这就是我写的-

str（程序）

创建一个[Unicode]str[从字节（通过解码）]。对吗？谢谢你关于

文件的提示。啊，我明白你的意思了。我以为str（）是用隐含的默认编码解码的。因此，我的回答应该解决OP的回答——尽管是间接的：$Thank。我要试一试！
with open(filename, 'w', encoding="utf-8") as proceedings_file:
    proceedings_file.write(proceedings)