<；类别'；UnicodeDecodeError'&燃气轮机；这只出现在Python3中，而不出现在Python2中_Python_Python 3.x

<；类别'；UnicodeDecodeError'&燃气轮机；这只出现在Python3中，而不出现在Python2中

python python-3.x

<；类别'；UnicodeDecodeError'&燃气轮机；这只出现在Python3中，而不出现在Python2中,python,python-3.x,Python,Python 3.x,我正在为一个城市政策课程做一个分析推文的项目。此脚本的目的是从同事下载的JSON文件中解析出某些信息。下面是我试图解析的一条示例推文的链接：我让我的一个朋友在Python2（Windows）的某个版本中测试了以下脚本，它成功了。但是，我的机器（Windows 10）运行的是最新版本的Python 3，它不适合我 import json import collections import sys, os import glob from datetime import datetime imp

我正在为一个城市政策课程做一个分析推文的项目。此脚本的目的是从同事下载的JSON文件中解析出某些信息。下面是我试图解析的一条示例推文的链接：

我让我的一个朋友在Python2（Windows）的某个版本中测试了以下脚本，它成功了。但是，我的机器（Windows 10）运行的是最新版本的Python 3，它不适合我

import json
import collections
import sys, os
import glob
from datetime import datetime
import csv


def convert(input):
    if isinstance(input, dict):
        return {convert(key): convert(value) for key, value in input.iteritems()}
    elif isinstance(input, list):
        return [convert(element) for element in input]
    elif isinstance(input, unicode):
        return input.encode('utf-8')
    else:
        return input

def to_ilan_csv(json_files):
    # write the column headers
    csv_writer = csv.writer(open("test.csv", "w"))
    headers = ["tweet_id", "handle", "username", "tweet_text", "has_image", "image_url", "created_at", "retweets", "hashtags", "mentions", "isRT", "isMT"]
    csv_writer.writerow(headers)

    # open the JSON files we stored and parse them into the CSV file we're working on
    try:
        #json_files = glob.glob(folder)
        print("Parsing %s files." % len(json_files))
        for file in json_files:
            f = open(file, 'r')
            if f != None:
                for line in f:
                    # hack to avoid the trailing \n at the end of the file - sitcking point LH 4/7/16
                    if len(line) > 3:
                        i = 0
                        tweets = convert(json.loads(line))
                        for tweet in tweets:
                            has_media = False
                            is_RT = False
                            is_MT = False
                            hashtags_list = []
                            mentions_list = []
                            media_list = []

                            entities = tweet["entities"]
                            # old tweets don't have key "media" so need a workaround
                            if entities.has_key("media"):
                                has_media = True
                                for item in entities["media"]:
                                    media_list.append(item["media_url"])

                            for hashtag in entities["hashtags"] :
                                hashtags_list.append(hashtag["text"])

                            for user in entities["user_mentions"]:
                                mentions_list.append(user["screen_name"])

                            if tweet["text"][:2] == "RT":
                                is_RT = True

                            if tweet["text"][:2] == "MT":
                                is_MT = True

                            values = [
                                tweet["id_str"],
                                tweet["user"]["id_str"],
                                tweet["user"]["screen_name"],
                                tweet["text"],
                                has_media,
                                ','.join(media_list) if len(media_list) > 0 else "",
                                datetime.strptime(tweet["created_at"], '%a %b %d %H:%M:%S +0000 %Y').strftime('%Y-%m-%d %H:%M:%S'),
                                tweet["retweet_count"],
                                ','.join(hashtags_list) if len(hashtags_list) > 0 else "",
                                ','.join(mentions_list) if len(mentions_list) > 0 else "",
                                is_RT,
                                is_MT
                            ]
                            csv_writer.writerow(values)
                    else:
                        continue
            f.close()

    except:
        print("Something went wrong. Quitting.")
        for i in sys.exc_info():
            print(i)

def parse_tweets():
    
    file_names = []
    file_names.append("C:\\Users\\Adam\\Downloads\\Test Code\\sample1.json")
    file_names.append("C:\\Users\\Adam\\Downloads\\Test Code\\sample2.json")
    to_ilan_csv(file_names)

然后我通过简单的执行来执行

parse_tweets()

但我得到了以下错误：

Parsing 2 files.
Something went wrong. Quitting.
<class 'UnicodeDecodeError'>
'charmap' codec can't decode byte 0x9d in position 3338: character maps to <undefined>
<traceback object at 0x0000016CCFEE5648>

解析2个文件。
出了点问题。退出。
“charmap”编解码器无法解码位置3338中的字节0x9d：字符映射到

我向一位CS朋友寻求帮助，但他无法诊断问题。所以我来到这里

我的问题这个错误是什么？为什么它只出现在Python3而不是Python2中

对于那些想尝试的人来说，所展示的代码应该能够使用Jupyter笔记本和我提供的drop box链接中的文件副本来运行。

Sooo，在chat中进行了一些调试后，下面是解决方案：

显然，OP使用的文件未正确识别为UTF-8，因此对该文件进行迭代（使用

for line in f

）会导致

cp1252

编码模块出现

UnicodeDecodeError

。我们通过明确地以utf-8的形式打开文件来修复此问题：

f = open(file, 'r', encoding='utf-8')

在我们这样做之后，文件就可以正确打开了，OP遇到了我们之前一直期待和看到的Python3问题。出现了以下三个问题：

“dict”对象没有属性“iteritems”

dict.iteritems（）

在Python 3中不再存在，因此我们只需切换到

dict.items（）

这里：

return {convert(key): convert(value) for key, value in input.items()}

未定义名称“unicode”

Unicode在Python 3中不再是一个单独的类型，普通的字符串类型已经能够使用Unicode，因此我们只删除这种情况：

elif isinstance(input, unicode):
    return input.encode('utf-8')

“dict”对象没有属性“has\u key”

为了检查字典中是否存在键，我们使用

in

操作符，因此

if

检查如下：

if "media" in entities:

之后，代码应该可以在Python3中正常运行。

为什么要执行

convert

位

csvwriter.writerow

需要字符串。这里似乎没有任何理由使用字节。请删除

尝试…除了部分，然后让程序自然崩溃。这样，您就得到了一个正确的堆栈跟踪，它将告诉您错误发生的确切位置。@Ryan为什么这与此相关？背景故事，我从一位教授那里得到了这个密码，她把密码交给了我。因为我没有经验，所以我没有真正质疑这个结构。但是它在Python2中工作。你认为这一部分可能是它在Python3中失败的原因之一吗？@poke好的，我会试试看。对我来说，这段代码在Python3中根本不起作用。我不得不改变它的三个部分，因为它使用了Python 3中不再存在的东西。所以我不知道你在那里做什么。也就是说，在我修复这些问题后，程序运行时没有出现错误您确定在这里使用的是Python 3吗？