Python parse.unquote_加上TypeError_Python_Parsing_Urllib_Typeerror

Python parse.unquote_加上TypeError

python parsing

Python parse.unquote_加上TypeError,python,parsing,urllib,typeerror,Python,Parsing,Urllib,Typeerror,我正在尝试格式化一个文件，以便将其插入数据库，该文件最初是压缩的，大约1.3MB大。每一行看起来都像这样： 398%7EAnoniem+001%7E，54348075250101775,0 这是解析此文件的代码的外观： Village = gzip.open(Root+'\\data'+'\\' +str(Newest_Date[0])+'\\' +str(Newest_Date[1])+'\\' +str(Newest_Date[2])\ +'\\'+st

我正在尝试格式化一个文件，以便将其插入数据库，该文件最初是压缩的，大约1.3MB大。每一行看起来都像这样：

398%7EAnoniem+001%7E，54348075250101775,0

这是解析此文件的代码的外观：

   Village = gzip.open(Root+'\\data'+'\\' +str(Newest_Date[0])+'\\' +str(Newest_Date[1])+'\\' +str(Newest_Date[2])\
               +'\\'+str(Newest_Date[3])+' village.gz');
Village_Parsed = str
for line in Village:
    Village_Parsed = Village_Parsed + urllib.parse.unquote_plus(line);
print(Village.readline());

当我运行该程序时，出现以下错误：

文件“C:\Python31\lib\urllib\parse.py”，第404行，在unquote\u plus中 string=string.replace（'+'，''） TypeError:应为具有缓冲区接口的对象

你知道这里怎么了吗？

提前感谢您的帮助：）

问题1是urllib.unquote\u plus不喜欢您输入的

行。消息应该是“请提供str对象”：-）我建议您解决下面的问题2，并插入：
print('line', type(line), repr(line))

紧跟在您的for
语句之后，这样您就可以看到行中的内容了
您会发现它返回字节对象：
>>> [line for line in gzip.open('test.gz')]
[b'nudge nudge\n', b'wink wink\n']

使用“r”模式几乎没有效果：
>>> [line for line in gzip.open('test.gz', 'r')]
[b'nudge nudge\n', b'wink wink\n']

我建议不要将line
传递给解析例程，而是传递line.decode（'UTF-8'）
。。。或者在编写gz文件时使用的任何编码
问题2在这一行：
Village_Parsed = str

str
是一种类型。您需要一个空str对象。要实现这一点，您可以调用类型，即str（）
，与使用字符串常量''
相比，该类型在形式上是正确的，但不实用/不寻常/可忽略/怪异。。。这样做：
Village_Parsed = ''

您还有问题3：您的上一个语句试图在EOF之后读取gz文件
import gzip, os, urllib.parse

archive_relpath = os.sep.join(map(str, Newest_Date[:4])) + ' village.gz'  
archive_path = os.path.join(Root, 'data', archive_relpath)

with gzip.open(archive_path) as Village:
    Village_Parsed = ''.join(urllib.parse.unquote_plus(line.decode('ascii'))
                             for line in Village)
    print(Village_Parsed)

输出：
398,~Anoniem 001~,543,480,7525010,1775,0
398，~Anoniem 001~，54348075250101775,0
注：表示：
本规范不强制执行
任何特定的字符编码
用于URI字符和
用于存储或存储数据的八位组
传输这些字符。当一个URI
出现在协议元素中
字符编码是由
协议没有这样一个
定义中，假定URI位于
字符编码与
周围的文字
因此，行中的'ascii'
。decode（'ascii'）
片段应替换为您用于编码文本的任何字符编码。
@JFSebastian:您确实尝试过吗？我得到的错误与操作完全相同。。。除了他的初始化问题外，您的代码在功能上似乎与他返回的字节对象相同。@John Machin:我已经试过了（现在）。我无法从字节中找到unquote\u plus\u，因此我们必须求助于显式bytes.decode方法。谢谢，您的解决方案非常有效，感谢您指出我的其他错误（Machin和Sebestina）。我不确定ascii是否是使用的字符编码，但据我所知，它工作起来没有任何问题。
398,~Anoniem 001~,543,480,7525010,1775,0