Python 解码时出错：ANSI字符â€；_Python_Python 3.x_Encoding

Python 解码时出错：ANSI字符â€；

python python-3.x encoding

Python 解码时出错：ANSI字符â€；,python,python-3.x,encoding,Python,Python 3.x,Encoding,我在Python3中有一个程序，可以读取和比较两个文件夹“gold”和“predcition”中的文件（同名文件）但是生成此错误时，我的文件是UTF8格式的，因此生成错误的字符是XE2 X80（在ANSI中是–欧元）：回溯（最近一次呼叫最后一次）：文件“C:\scienceie2017\u train\test.py”，第215行，在 calculateMeasures（文件夹\u gold、文件夹\u pred、删除\u anno）文件“C:\scienceie2017\u train

我在Python3中有一个程序，可以读取和比较两个文件夹“gold”和“predcition”中的文件（同名文件）

但是生成此错误时，我的文件是UTF8格式的，因此生成错误的字符是XE2 X80（在ANSI中是–欧元）：

回溯（最近一次呼叫最后一次）：
文件“C:\scienceie2017\u train\test.py”，第215行，在
calculateMeasures（文件夹\u gold、文件夹\u pred、删除\u anno）
文件“C:\scienceie2017\u train\test.py”，第34行，在calculateMeasures中
res_full_pred，res_pred，span_pred，rels_pred=normaliseAnnotations（f_pred，remove_anno）
文件“C:\scienceie2017\u train\test.py”，第132行，在注释中
对于文件中的l\u anno：
文件“C:\Users\chedi\Anaconda3\lib\codecs.py”，第321行，解码
（结果，消耗）=自身缓冲区解码（数据，自身错误，最终）
UnicodeDecodeError:“utf-8”编解码器无法解码位置915-916中的字节：无效的连续字节

代码是：

#!/usr/bin/python
# by Mattew Peters, who spotted that sklearn does macro averaging not
# micro averaging correctly and changed it

import os
from sklearn.metrics import precision_recall_fscore_support
import sys


def calculateMeasures(folder_gold="data/dev/", folder_pred="data_pred/dev/", remove_anno=""):
    '''
    Calculate P, R, F1, Macro F
    :param folder_gold: folder containing gold standard .ann files
    :param folder_pred: folder containing prediction .ann files
    :param remove_anno: if set if "rel", relations will be ignored. Use this setting to only evaluate
    keyphrase boundary recognition and keyphrase classification. If set to "types", only keyphrase boundary recognition is evaluated.
    Note that for the later, false positive
    :return:
    '''

    flist_gold = os.listdir(folder_gold)
    res_all_gold = []
    res_all_pred = []
    targets = []

    for f in flist_gold:
        # ignoring non-.ann files, should there
        # be any
        if not str(f).endswith(".ann"):
            continue
        f_gold = open(os.path.join(folder_gold, f), "r", encoding="utf")
        try:
            f_pred = open(os.path.join(folder_pred, f), "r", encoding="utf8")
            res_full_pred, res_pred, spans_pred, rels_pred = normaliseAnnotations(f_pred, remove_anno)
        except IOError:
            print(f + " file missing in " + folder_pred + ". Assuming no predictions are available for this file.")
            res_full_pred, res_pred, spans_pred, rels_pred = [], [], [], []

        res_full_gold, res_gold, spans_gold, rels_gold = normaliseAnnotations(f_gold, remove_anno)

        spans_all = set(spans_gold + spans_pred)

        for i, r in enumerate(spans_all):
            if r in spans_gold:
                target = res_gold[spans_gold.index(r)].split(" ")[0]
                res_all_gold.append(target)
                if not target in targets:
                    targets.append(target)
            else:

                res_all_gold.append("NONE")

            if r in spans_pred:
                target_pred = res_pred[spans_pred.index(r)].split(" ")[0]
                res_all_pred.append(target_pred)
            else:

                res_all_pred.append("NONE")

        #y_true, y_pred, labels, targets
        prec, recall, f1, support = precision_recall_fscore_support(res_all_gold, res_all_pred, labels=targets, average=None)
        metrics = {}
        for k, target in enumerate(targets):
            metrics[target] = {
                'precision': prec[k],
                'recall': recall[k],
                'f1-score': f1[k],
                'support': support[k]
            }

        # now
        # micro-averaged
        if remove_anno != 'types':
            prec, recall, f1, s = precision_recall_fscore_support(res_all_gold, res_all_pred, labels=targets, average='micro')
            metrics['overall'] = {
                'precision': prec,
                'recall': recall,
                'f1-score': f1,
                'support': sum(support)
            }
        else:
            # just
            # binary
            # classification,
            # nothing
            # to
            # average
            metrics['overall'] = metrics['KEYPHRASE-NOTYPES']

    print_report(metrics, targets)
    return metrics

def print_report(metrics, targets, digits=2):
    def _get_line(results, target, columns):
        line = [target]
        for column in columns[:-1]:
            line.append("{0:0.{1}f}".format(results[column], digits))
        line.append("%s" % results[columns[-1]])
        return line

    columns = ['precision', 'recall', 'f1-score', 'support']

    fmt = '%11s' + '%9s' * 4 + '\n'
    report = [fmt % tuple([''] + columns)]
    report.append('\n')
    for target in targets:
        results = metrics[target]
        line = _get_line(results, target, columns)
        report.append(fmt % tuple(line))
    report.append('\n')

    # overall
    line = _get_line(
    metrics['overall'], 'avg / total', columns)
    report.append(fmt % tuple(line))
    report.append('\n')

    print(''.join(report))

def normaliseAnnotations(file_anno, remove_anno):
    '''
    Parse annotations from the annotation files: remove relations (if requested), convert rel IDs to entity spans
    :param file_anno:
    :param remove_anno:
    :return:
    '''
    res_full_anno = []
    res_anno = []
    spans_anno = []
    rels_anno = []

    for l in file_anno:
        print(l)
        print(l.strip('\n'))
        r_g = l.strip('\n').split("\t")
        print(r_g)
        print(len(r_g))
        r_g_offs = r_g[1].split()
        print(r_g_offs)
        if remove_anno != "" and r_g_offs[0].endswith("-of"):
            continue

        res_full_anno.append(l.strip())

        if r_g_offs[0].endswith("-of"):
            arg1 = r_g_offs[1].replace("Arg1:", "")
            arg2 = r_g_offs[2].replace("Arg2:", "")
            for l in res_full_anno:
                r_g_tmp = l.strip().split("\t")
                if r_g_tmp[0] == arg1:
                    ent1 = r_g_tmp[1].replace(" ", "_")
                if r_g_tmp[0] == arg2:
                    ent2 = r_g_tmp[1].replace(" ", "_")

            spans_anno.append(" ".join([ent1, ent2]))
            res_anno.append(" ".join([r_g_offs[0], ent1, ent2]))
            rels_anno.append(" ".join([r_g_offs[0], ent1, ent2]))

        else:
            spans_anno.append(" ".join([r_g_offs[1], r_g_offs[2]]))
            keytype = r_g[1]
            if remove_anno == "types":
                keytype = "KEYPHRASE-NOTYPES"
            res_anno.append(keytype)

    for r in rels_anno:
        r_offs = r.split(" ")
# reorder hyponyms to start with smallest index
# 1, 2
        if r_offs[0] == "Synonym-of" and r_offs[2].split("_")[1] < r_offs[1].split("_")[1]:
            r = " ".join([r_offs[0], r_offs[2], r_offs[1]])
        if r_offs[0] == "Synonym-of":
            for r2 in rels_anno:
                r2_offs = r2.split(" ")
                if r2_offs[0] == "Hyponym-of" and r_offs[1] == r2_offs[1]:
                    r_new = " ".join([r2_offs[0], r_offs[2], r2_offs[2]])
                    rels_anno[rels_anno.index(r2)] = r_new

                if r2_offs[0] == "Hyponym-of" and r_offs[1] == r2_offs[2]:
                    r_new = " ".join([r2_offs[0], r2_offs[1], r_offs[2]])
                    rels_anno[rels_anno.index(r2)] = r_new

    rels_anno = list(set(rels_anno))

    res_full_anno_new = []
    res_anno_new = []
    spans_anno_new = []

    for r in res_full_anno:
        r_g = r.strip().split("\t")
        if r_g[0].startswith("R") or r_g[0] == "*":
            continue
        ind = res_full_anno.index(r)
        res_full_anno_new.append(r)
        res_anno_new.append(res_anno[ind])
        spans_anno_new.append(spans_anno[ind])

    for r in rels_anno:
        res_full_anno_new.append("R\t" + r)
        res_anno_new.append(r)
        spans_anno_new.append(" ".join([r.split(" ")[1], r.split(" ")[2]]))

    return res_full_anno_new, res_anno_new, spans_anno_new, rels_anno

if __name__ == '__main__':
    folder_gold = "data/dev/"
    folder_pred = "data_pred/dev/"
    remove_anno = ""  # "", "rel" or "types"
    if len(sys.argv) >= 2:
        folder_gold = sys.argv[1]
    if len(sys.argv) >= 3:
        folder_pred = sys.argv[2]
    if len(sys.argv) == 4:
        remove_anno = sys.argv[3]

    calculateMeasures(folder_gold, folder_pred, remove_anno)

黄金文件示例

T1材料2 20脉动真空
T2过程45 59量子场
T3任务45 59量子场
T4过程74 92自由麦克斯韦场
T5过程135 151费米子场
T6过程195 222经历真空波动
T7工艺257 272卡西米尔效应
T8任务396 411核物理
T9任务434 464核子的“MIT袋模型”
T10任务518 577描述受限夸克的费米子场集合
T11过程732 804袋边界条件修改场的真空波动
T12任务983 998核物理
T13材料1063 1080袋型核子
T14材料507514核子
T15任务843 856卡西米尔部队
T16处理289 300个此类字段

编码（“cp1256”）。解码（“utf8”）

–

，一个破折号

您正在打开的文件似乎是用UTF-8编码的，您没有指定在

open（）

对其进行编码时应使用的编码（只需将

encoding=“utf8”

添加到参数中即可）

Python将使用操作系统的默认字符编码，而您似乎正在使用Windows，在Windows中，它始终不是UTF-8

import locale
locale.getpreferredencoding()

要了解Python在读写文件时默认使用的编码方式。

很抱歉，我忘了对代码进行更改，我还使用了encoding=“utf8”，但我找到的解决方案是使用拉丁编码=“拉丁-1”！我可以在哪里添加encode decode指令？您不需要在任何地方添加encode decode代码段，我只是想显示可能的字符序列的来源（也就是说，您可以通过在UTF-8中编码破折号字符并将其解码为

cp1256

）来获得

“

）.–不是ANSI字符。我注意到，当我更改为ANSI时，它是–，当我更改为UTF8时，它是XE2 X80