Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/292.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/8/python-3.x/17.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 读取带有汉字的CSV文件时出现UnicodeDecodeError_Python_Python 3.x_Csv_Unicode_Chinese Locale - Fatal编程技术网

Python 读取带有汉字的CSV文件时出现UnicodeDecodeError

Python 读取带有汉字的CSV文件时出现UnicodeDecodeError,python,python-3.x,csv,unicode,chinese-locale,Python,Python 3.x,Csv,Unicode,Chinese Locale,我正在尝试在Elasticsearch中将中文csv索引为文档。CSV中的数据以以下字节开始: b'Chapter,Content,Score\r\n1.1.1,\xacO\xa7_\xa4w\xc5\xe7\xc3\xd2\xab~\xbd\xe8\xa8t\xb2\xce\xa9\xd2\xbb\xdd\xaa\xba\xa6U\xb6\xb5\xba\xde\xa8\xee\xacy\xb5{\xa1H,1\r\n1.1.2,\xab~\xbd\xe8\xba\xde\xb2z\xa8t\

我正在尝试在Elasticsearch中将中文csv索引为文档。CSV中的数据以以下字节开始:

b'Chapter,Content,Score\r\n1.1.1,\xacO\xa7_\xa4w\xc5\xe7\xc3\xd2\xab~\xbd\xe8\xa8t\xb2\xce\xa9\xd2\xbb\xdd\xaa\xba\xa6U\xb6\xb5\xba\xde\xa8\xee\xacy\xb5{\xa1H,1\r\n1.1.2,\xab~\xbd\xe8\xba\xde\xb2z\xa8t\xb2\xce\xacO\xa7_\xb2\xc5\xa6XISO\xbc\xd0\xb7\xc7\xaa\xba\xadn\xa8D\xa1H,1\r\n'
代码如下所示

import csv
import json
import pandas as pd
from elasticsearch import Elasticsearch
es=Elasticsearch("https://xxx.us-east-1.es.amazonaws.com/")
from elasticsearch import helpers
import codecs
def csv_reader(file_name):
es = Elasticsearch("https://xxx.us-east-1.es.amazonaws.com/")
with codecs.open(file_name, 'r', 'utf-8') as outfile:
    reader = csv.DictReader(outfile)
    helpers.bulk(es, reader, index="checklist", doc_type="quality")
if __name__ == "__main__":
with open('checklist1.csv') as f_obj:
    csv_reader('checklist1.csv')
然后出现以下错误消息:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xac in position 0: invalid start byte

该文件不是UTF8编码的,从错误中可以清楚地看出这一点。用编辑器打开csv表明它可能是
latin2
,这显然是错误的,因为它不包括中文字符。毫无疑问,使用这种编码“有效”(不会引发错误),但却是胡言乱语:

Chapter,Content,Score
1.1.1,ŹO§_¤wĹçĂŇŤ~˝č¨t˛ÎŠŇťÝŞşŚUśľşŢ¨îŹyľ{ĄH,1
请看《香港增补字符集》中的
big5
big5hkscs
,它们是为繁体中文编写的。当
print
ed时,两者给出相同的结果:

Chapter,Content,Score
1.1.1,是否已驗證品質系統所需的各項管制流程?,1

这是否有意义,只能由会说汉语的人来回答,但转换成功而没有错误的事实是有希望的。

这是一个重复的例子。您必须使用正确的编码。注意
codecs
是一个旧模块,Python 3中内置的
open
直接用于编码:

import csv

s = b'Chapter,Content,Score\r\n1.1.1,\xacO\xa7_\xa4w\xc5\xe7\xc3\xd2\xab~\xbd\xe8\xa8t\xb2\xce\xa9\xd2\xbb\xdd\xaa\xba\xa6U\xb6\xb5\xba\xde\xa8\xee\xacy\xb5{\xa1H,1\r\n1.1.2,\xab~\xbd\xe8\xba\xde\xb2z\xa8t\xb2\xce\xacO\xa7_\xb2\xc5\xa6XISO\xbc\xd0\xb7\xc7\xaa\xba\xadn\xa8D\xa1H,1\r\n'

# Create a file with your sample byte string
with open('checklist.csv','wb') as f:
    f.write(s)

# Open it with the correct encoding and newline requirements for using DictReader.
with open('checklist.csv',encoding='big5',newline='') as f:
    r = csv.DictReader(f)
    for line in r:
        print(line['Content'])
输出:

是否已驗證品質系統所需的各項管制流程?
品質管理系統是否符合ISO標準的要求?

请修正你的缩进。消除不必要的进口
csv
json
pandas
似乎都是多余的。