Python 索引器:由于i';我改变了文件的读取方式
我正在尝试读取并重新格式化一个非常大的(2GB+).out文件,该文件的结构类似于csv。我以前使用过标准的open(),但没有这样的问题,但将其改为codecs.open(),因为它在某些字符方面有问题 它现在正在投掷Python 索引器:由于i';我改变了文件的读取方式,python,csv,utf-16,index-error,Python,Csv,Utf 16,Index Error,我正在尝试读取并重新格式化一个非常大的(2GB+).out文件,该文件的结构类似于csv。我以前使用过标准的open(),但没有这样的问题,但将其改为codecs.open(),因为它在某些字符方面有问题 它现在正在投掷 回溯(最近一次呼叫最后一次): 第21行,在 如果(r[5]==”): 索引器:在第一行列出索引超出范围,尽管在r[5]处肯定有一个元素。 (运行时间为0.301s) 导入系统 导入csv 导入日期时间 导入编解码器 maxInt=sys.maxsize 减量=真 减量化: 减
回溯(最近一次呼叫最后一次):
第21行,在
如果(r[5]==”):
索引器:在第一行列出索引超出范围
,尽管在r[5]处肯定有一个元素。
(运行时间为0.301s)
导入系统
导入csv
导入日期时间
导入编解码器
maxInt=sys.maxsize
减量=真
减量化:
减量=假
尝试:
csv.字段大小限制(maxInt)
除溢出错误外:
maxInt=int(maxInt/10)
减量=真
以codecs.open(“file.out”、“rU”、“utf-16-be”)作为源:
rdr=csv.reader(源)
打开(“out.csv”,“w”,换行符=”)作为结果:
wtr=csv.writer(结果)
wtr.writerow((“第1列”、“第2列”、“第3列”、“等”)
对于rdr中的r:
如果(r[5]==”):
持续
wtr.writerow((datetime.datetime.strtime(r[5],“%m/%d/%Y”).strftime(“%Y-%m-%d”)、r[3]、r[7]、r[9]+r[10]+“”+r[12]))
使用utf-8抛出UnicodeDecodeError:“utf-8”编解码器无法解码位置12处的字节0xc9:无效的连续字节
使用latin-1或ISO-8859-1抛出UnicodeEncodeError:“charmap”编解码器无法对位置57-58中的字符进行编码:字符映射到,尽管运行了更多
输入文件如下所示:
"A00017","K","G","1999","4530","01/12/1999","","","","PEOPLE TO ELECT MANGINELLI","","","","258 MAGNIOLIA DRIVE","SELDEN","NY","11784","","","404.57","","","","","","","2","","NAA","07/22/1999 08:43:59"
"A00037","K","G","1999","999999","01/12/1999","","","","CITIZENS TO ELECT TEDISCO TO ASSEMBLY","","","","","","","","","","0","","","","","","","2","","",""
"A00037","K","N","1999","1693","01/15/1999","","","","OUTSTANDING LOAN","","","","2176 GUILDERLAND AVE","SCHENECTADY","NY","12306","","","10474.8","10474.8","","","OTHER","","PREVIOUS LOAN FROM JAMES TEDISCO","","P","JM","07/15/1999 15:08:17"
"A00037","J","N","2000","1694","01/13/2000","","","","OUTSTANDING LOAN","","","","2176 GUILDERLAND","SCHENECTADY","NY","12306","","","10474.8","10474.8","","","OTHER","","LOANS FROM PREVIOUS CAMPAIGNS FROM J","","P","JM","01/14/1900 16:35:09"
"A00037","K","X","2000","999999","","","","","","","","","","","","","","","","","","","","","","","","","07/20/2000 00:00:00"
"A00037","J","X","2001","999999","","","","","","","","","","","","","","","","","","","","","","","","","01/17/2001 00:00:00"
"A00037","K","X","2002","999999","","","","","","","","","","","","","","","","","","","","","","","","","07/19/2002 00:00:00"
"A00037","J","X","2003","999999","","","","","","","","","","","","","","","","","","","","","","","","","01/21/2003 00:00:00"
"A00037","K","X","2003","999999","","","","","","","","","","","","","","","","","","","","","","","","","07/16/2003 00:00:00"
"A00037","J","X","2004","999999","","","","","","","","","","","","","","","","","","","","","","","","","01/22/2004 00:00:00"
我走到今天多亏了:
在正在读取的“file.out”中,找出行中每个单元格元素之间的分隔字符。类似于“\t”-制表符或“,”-逗号,并将其传递给“delimiter”属性 尝试打印“r”,并查看列名或行中值之间的字符
rdr = csv.reader(source,delimiter=<separator>)
rdr=csv.reader(源代码,分隔符=)
能否在循环中打印r
?而不必看到重现问题的文件,我们很难帮助您。您是否尝试过打印“r”并查看它是否是数组?尝试过将r打印到控制台并得到UnicodeEncodeError:“charmap”编解码器无法对位置2-97中的字符进行编码:字符映射到
在编解码器中使用“utf-8”。打开(“file.out”,“rU”,“utf-8”)您是否可以尝试使用“latin-1”或“ISO-8859-1”编码而不是“utf-8”您是否可以尝试以下代码:rdr=csv.reader((line.replace('\0','')表示源代码中的行),delimiter=',')。您是否可以共享导致错误的数据中的行